[SERVER-34518] WT Transaction / Timestamp error during non-responsiveness Created: 17/Apr/18  Updated: 16/Oct/19  Resolved: 24/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chad Kreimendahl Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:
Case:

 Description   

We recently migrated one of our replicasets to 3.6.3 from 3.4. It had been running just fine for about 4 days when we had a rash of slow queries out of nowhere, the database became super-slow to unresponsive. During that time, the only indication of problems other than log data showing slow queries were the following messages (by the ton):

Line 181225: 2018-04-16T07:28:10.035-0500 E STORAGE [conn908492] WiredTiger error (22) [1523881690:35506][8828:0x7fbc77746700], WT_SESSION.begin_transaction: read timestamp 5ad496da00000001 older than oldest timestamp 5ad496da00000004: Invalid argument



 Comments   
Comment by Dmitry Agranat [ 22/Jul/18 ]

Hi sallgeud,

As mentioned earlier, after carefully reviewing the last set of uploaded data, we were not able to find any issue with the originally reported "WT Transaction / Timestamp error during non-responsiveness". As such, we will close this case. If you still experience issues with the "WT Transaction / Timestamp error during non-responsiveness", please upload the archived mongoD log, diagnostic.data, the exact time of the issue and we will reopen this case.

As for the mentioned issue with how MongoDB clients route read operations, if you suspect this is a bug, please open a separate case and we will be happy to investigate it.

Thanks,
Dima

Comment by Chad Kreimendahl [ 13/Jul/18 ]

Every single query that shows "secondaryPreferred" is wrong. 100% of those queries are at least 2-20x slower than they are in version 3.4.  They are also 100% NOT secondary queries. This is entirely a primary system. Not a single line of our code EVER queries from a secondary. So we should never, ever, ever see "secondaryPreferred".

Comment by Dmitry Agranat [ 10/Jul/18 ]

Hi sallgeud,

After carefully reviewing the last set of logs and diagnostic data, we’re not able to find the necessary clues/log lines that you mention, did we miss anything? Can you please re-check on your end and point us to a timestamp/logline where this issue occurred?

Thanks,
Dima

Comment by Chad Kreimendahl [ 05/Jul/18 ]

Ramon,

I've uploaded an up-to-date set of both new diagnostic-data directory (in a tar file) and log files from the past 4 days.  My hope is that this covers you for what you need.

Comment by Ramon Fernandez Marina [ 18/Jun/18 ]

Yes sallgeud, the secure portal for this ticket is still available.

Regards,
Ramón.

Comment by Chad Kreimendahl [ 18/Jun/18 ]

We'll have to just do the logs and the diagnostic data uploads at the same time. Will the same URL work?

Comment by Dmitry Agranat [ 14/Jun/18 ]

Thanks sallgeud for uploading the archived content of $dbpath/diagnostic.data. Could you also upload the covering mongod log and provide the exact time (including the time zone) of non-responsiveness?

Comment by Chad Kreimendahl [ 13/Jun/18 ]

We're definitely still seeing the error. I've started an upload of a tar of that directory now. It's quite large, but will be done in the next hour.

Comment by Dmitry Agranat [ 12/Jun/18 ]

Hi sallgeud,

The log message that you are seeing is by design. During periods of slowness, it is possible for the oldest timestamp to race ahead of the oplog read timestamp. When this error occurs, the oplog read simply retries the operation using an updated read timestamp.

To understand the reason of the slowness, we've requested archived content of $dbpath/diagnostic.data which I could not find under the secure portal we've provided. Since the reported error occurred on March 16th, I suspect the content of the diagnostic.data was already overwritten. If this issue is still a problem for you, please provide an up to date:

  • archived content of $dbpath/diagnostic.data
  • complete mongod log files ideally covering time from upgrade until the present

Thanks,
Dima

Comment by Chad Kreimendahl [ 06/Jun/18 ]

On that note:  The only queries that run on the secondaries are things that are run manually.  Yet, nearly every single slow query (over 95%+ at last check) showed that it was being queried as "secondaryPreferred", even though the system logging it is the primary, and every single one of the queries is set to be primary. 

 

The slowness is also frequent, in that it happens at least several times an hour, between 10 and 100 times per hour. It happens only in short bursts, such that you'll get a rash of slow queries all within a few seconds... then all is well for a long period of time (minutes).

Comment by bgentry [ 06/Jun/18 ]

Chad asked me to upload the log files to the private portal, which I have done.  After the 3.6 node became primary, you will also see a lot of slow (multi-second) queries with $readPreference: { mode: "secondaryPreferred" }.  We are specifying primaryPreferred (via the C# driver), so we don't understand why secondaryPreferred is being used.  They seem to occur in batches every minute or so.  This did not occur when the 3.4 node was primary.

Comment by Bruce Lucas (Inactive) [ 17/Apr/18 ]

Hi Chad,

So we can investigate it can you please upload to this secure private portal, for the affected instance:

  • archived content of $dbpath/diagnostic.data
  • complete mongod log files ideally covering time from upgrade until the present

Also can you please give me a timeline of the events that you describe as accurately as you can (including timezone). This will ensure that we're looking at the right time period in the data.

Thanks,
Bruce

Generated at Thu Feb 08 04:36:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.