[SERVER-32226] oldest_timestamp should track the last applied time, during initial sync Created: 08/Dec/17  Updated: 30/Oct/23  Resolved: 11/Dec/17

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: 3.6.0
Fix Version/s: 3.6.1, 3.7.1

Type: Task Priority: Critical - P2
Reporter: Eric Milkie Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: RF36
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File PinnedMaster.png     PNG File PinnedPatch.png     PNG File comparison.png     File server32226.tgz    
Issue Links:
Backports
Related
related to SERVER-32185 Freshly synced secondaries respond to... Closed
related to SERVER-32236 Avoid stalling ftdc when system is st... Closed
is related to SERVER-49006 Only advance oldest timestamp when se... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.6
Sprint: Repl 2017-12-18
Participants:
Case:

 Description   

Currently, the oldest_timestamp only tracks the stable timestamp, which tracks applied optimes as they commit. However, applied optimes only become stable timestamp candidates if the consistency mode of the node is "Consistent". During initial sync, the consistency mode of the node is not "Consistent", and thus we do not add any new stable timestamp candidates while in the oplog replay phase. This results in the oldest_timestamp not moving, which can quickly fill up cache with timestamp data during the oplog replay phase of initial sync.



 Comments   
Comment by Githook User [ 11/Dec/17 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-32226: Have the oldest timestamp track the last applied optime during initial sync.

(cherry picked from commit 1a80d3e2817f758f07fa4de722ad278df4de7ac7)
Branch: v3.6
https://github.com/mongodb/mongo/commit/26caeb66ad2a6733313f1e9c25afb812fd4a284e

Comment by Githook User [ 11/Dec/17 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-32226: Have the oldest timestamp track the last applied optime during initial sync.
Branch: master
https://github.com/mongodb/mongo/commit/1a80d3e2817f758f07fa4de722ad278df4de7ac7

Comment by Michael Cahill (Inactive) [ 10/Dec/17 ]

daniel.gottlieb, I think we should take your fix for this issue.

In parallel, I'll work up the WiredTiger change I sketched out above and we can decide if it will be feasible. As you noted, care will be required when the oldest_timestamp is set for the first time, which might create more problems than it solves.

Comment by Eric Milkie [ 10/Dec/17 ]

Since reads are disallowed while a node is doing an initial sync, I'm not sure the "readConcernMajority lag" statistic makes sense. If anything, it should read as "infinity" during the initial sync period.

Comment by Bruce Lucas (Inactive) [ 10/Dec/17 ]

I don't have an explanation for the value of "10" after initial sync completes.

I wonder if this somehow relates to the fact that the replica set is now idle and we are writing no-ops at 10-second intervals. We could test this theory by adjusting periodicNoopIntervalSecs to see if range of timestamps pinned changes.

Comment by Bruce Lucas (Inactive) [ 10/Dec/17 ]

  • As expected and as you noted range of timestamps pinned is now well behaved
  • Lookaside score is now well behaved as well, and this was one of the indicators of cache problems during initial sync. I'm not sure why this is different in patch though as the amount of data in cache is about the same in both master and patch.
  • However readConcernMajority lag (difference between readConcernMajority and applied optimes) now grows during oplog replay phase, whereas previously, aside from an initial very large spike, it was 0. We should check the ftdc from the primary and other secondary to make sure they aren't seeing the same thing: if they are that would be problematic wrt cache and would need to be fixed; if not then this is I guess in effect a reporting error for the new secondary and would be nice to fix if possible, I think.
  • Our reported replica set lag (difference between replSetGetStats optimeDate on primary vs secondary) also grows during this phase, whereas (I believe, can check again) in 3.4 we reported decreasing lag as we applied ops and caught up. Since this provides an indicator of progress it might be nice to see if this could be fixed.
Comment by Daniel Gottlieb (Inactive) [ 10/Dec/17 ]

I've attached screenshots of FTDC data demonstrating a workload against master and a patch put together (not michael.cahill's idea, but simply advancing the oldest timestamp in the oplog application phase). As per the previous comments, we may not go with this patch. I just wanted to put the numbers out there to ensure there's consensus on what we're fixing.

Against master we can see transactions being pinned shoot up right when documents are inserted into the oplog (the beginning of the oplog application phase). This screenshot was taken from FTDC data before the initial sync completed. My goal was to pipeline some of my work. The attached FTDC is the full timespan.

Against the patch we can see the transactions being pinned stays steady while documents are inserted into the oplog. I don't have an explanation for the value of "10" after initial sync completes. I also don't know why I chose "filtered". I believe those details are inconsequential.

This wasn't a real perf test (was using non-optimized, debug builds), and in both cases the oplog application phase took rough the same time, 3m30s for master, 3m15s for the patch. Also this was an insert only workload and without constrained WT cache. bruce.lucas I've attached the full FTDC data from master/the patch. What we have is a pretty educated guess as to what's going on. I just want to make sure you're not looking at some additional data points in FTDC that would suggest this change of advancing the oldest timestamp would not fix the initial sync problem we're seeing.

Comment by Daniel Gottlieb (Inactive) [ 10/Dec/17 ]

michael.cahill I haven't thought of the problem that way, but you're suggestion would be much easier. One assumption we'd have to vet is that the replication's initial call to set the oldest timestamp coming out of initial sync is synchronous with processing future batches. Otherwise the secondary may silently throw away history that is necessary for a consistent read of the data.

On a similar note, a (future) side-effect would be for recover to a timestamp. We'd have to make sure the oldest timestamp is set before oplog recovery happens on startup.

Comment by Michael Cahill (Inactive) [ 10/Dec/17 ]

daniel.gottlieb, it would be a relatively simple change to WiredTiger to have it not maintain history if no oldest_timestamp is set, and to provide a way to clear the oldest_timestamp.

Would it be easier to say that we only set an oldest timestamp in some finite set of cases (e.g., when the repl state is PRIMARY or SECONDARY) than to explicitly track track optimes as they are applied?

Comment by Bruce Lucas (Inactive) [ 08/Dec/17 ]

This has been observed in some cases to cause initial sync to get permanently stuck with a full cache.

Generated at Thu Feb 08 04:29:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.