[SERVER-32226] oldest_timestamp should track the last applied time, during initial sync Created: 08/Dec/17 Updated: 30/Oct/23 Resolved: 11/Dec/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | 3.6.0 |
| Fix Version/s: | 3.6.1, 3.7.1 |
| Type: | Task | Priority: | Critical - P2 |
| Reporter: | Eric Milkie | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | RF36 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Backport Requested: |
v3.6
|
||||||||||||||||||||
| Sprint: | Repl 2017-12-18 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
Currently, the oldest_timestamp only tracks the stable timestamp, which tracks applied optimes as they commit. However, applied optimes only become stable timestamp candidates if the consistency mode of the node is "Consistent". During initial sync, the consistency mode of the node is not "Consistent", and thus we do not add any new stable timestamp candidates while in the oplog replay phase. This results in the oldest_timestamp not moving, which can quickly fill up cache with timestamp data during the oplog replay phase of initial sync. |
| Comments |
| Comment by Githook User [ 11/Dec/17 ] |
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: (cherry picked from commit 1a80d3e2817f758f07fa4de722ad278df4de7ac7) |
| Comment by Githook User [ 11/Dec/17 ] |
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: |
| Comment by Michael Cahill (Inactive) [ 10/Dec/17 ] |
|
daniel.gottlieb, I think we should take your fix for this issue. In parallel, I'll work up the WiredTiger change I sketched out above and we can decide if it will be feasible. As you noted, care will be required when the oldest_timestamp is set for the first time, which might create more problems than it solves. |
| Comment by Eric Milkie [ 10/Dec/17 ] |
|
Since reads are disallowed while a node is doing an initial sync, I'm not sure the "readConcernMajority lag" statistic makes sense. If anything, it should read as "infinity" during the initial sync period. |
| Comment by Bruce Lucas (Inactive) [ 10/Dec/17 ] |
I wonder if this somehow relates to the fact that the replica set is now idle and we are writing no-ops at 10-second intervals. We could test this theory by adjusting periodicNoopIntervalSecs to see if range of timestamps pinned changes. |
| Comment by Bruce Lucas (Inactive) [ 10/Dec/17 ] |
|
|
| Comment by Daniel Gottlieb (Inactive) [ 10/Dec/17 ] |
|
I've attached screenshots of FTDC data demonstrating a workload against master and a patch put together (not michael.cahill's idea, but simply advancing the oldest timestamp in the oplog application phase). As per the previous comments, we may not go with this patch. I just wanted to put the numbers out there to ensure there's consensus on what we're fixing. Against master we can see transactions being pinned shoot up right when documents are inserted into the oplog (the beginning of the oplog application phase). This screenshot was taken from FTDC data before the initial sync completed. My goal was to pipeline some of my work. The attached FTDC is the full timespan. Against the patch we can see the transactions being pinned stays steady while documents are inserted into the oplog. I don't have an explanation for the value of "10" after initial sync completes. I also don't know why I chose "filtered". I believe those details are inconsequential. This wasn't a real perf test (was using non-optimized, debug builds), and in both cases the oplog application phase took rough the same time, 3m30s for master, 3m15s for the patch. Also this was an insert only workload and without constrained WT cache. bruce.lucas I've attached the full FTDC data from master/the patch. What we have is a pretty educated guess as to what's going on. I just want to make sure you're not looking at some additional data points in FTDC that would suggest this change of advancing the oldest timestamp would not fix the initial sync problem we're seeing. |
| Comment by Daniel Gottlieb (Inactive) [ 10/Dec/17 ] |
|
michael.cahill I haven't thought of the problem that way, but you're suggestion would be much easier. One assumption we'd have to vet is that the replication's initial call to set the oldest timestamp coming out of initial sync is synchronous with processing future batches. Otherwise the secondary may silently throw away history that is necessary for a consistent read of the data. On a similar note, a (future) side-effect would be for recover to a timestamp. We'd have to make sure the oldest timestamp is set before oplog recovery happens on startup. |
| Comment by Michael Cahill (Inactive) [ 10/Dec/17 ] |
|
daniel.gottlieb, it would be a relatively simple change to WiredTiger to have it not maintain history if no oldest_timestamp is set, and to provide a way to clear the oldest_timestamp. Would it be easier to say that we only set an oldest timestamp in some finite set of cases (e.g., when the repl state is PRIMARY or SECONDARY) than to explicitly track track optimes as they are applied? |
| Comment by Bruce Lucas (Inactive) [ 08/Dec/17 ] |
|
This has been observed in some cases to cause initial sync to get permanently stuck with a full cache. |