[SERVER-39608] Initial syncer oplog fetcher should skip first doc on retries Created: 15/Feb/19 Updated: 29/Oct/23 Resolved: 10/Apr/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.11 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Vesselina Ratcheva (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prepare_initial_sync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Steps To Reproduce: | Applying this small patch on 29d91ec77892f8b335bcc07ccf96e3240981ab05 reproduces the issue.
|
||||||||||||||||||||
| Sprint: | Repl 2019-03-25, Repl 2019-04-08, Repl 2019-04-22 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 59 | ||||||||||||||||||||
| Description |
|
https://github.com/mongodb/mongo/commit/4f858c52b05ecc49d2ae19bbaf59fc0aad445b7e started enqueuing the first document the OplogFetcher returned instead of skipping it. When the OplogFetcher retries its oplog query, if it has enqueued any documents (so not if it's retrying the very first query before enqueuing anything successfully), then this will cause it to duplicate a document and hit an invariant. |
| Comments |
| Comment by Luke Chen [ 11/Apr/19 ] |
|
Fixing up the fixVersion as this ticket was not included as part of 4.1.10 release. |
| Comment by Githook User [ 10/Apr/19 ] |
|
Author: {'email': 'vesselina.ratcheva@10gen.com', 'name': 'Vesselina Ratcheva', 'username': 'vessy-mongodb'}Message: |
| Comment by Vesselina Ratcheva (Inactive) [ 02/Apr/19 ] |
|
After discussing with judah.schvimer, we decided that making decisions based on the information from the AbstractOplogFetcher was not a good idea. A big problem is that it does not give us enough information about where in the batch we failed - in particular, it cannot tell us whether we failed midway through a batch or failed to fetch anything in the first place. This means it won't solve the bug we're after. When fetching, we would not be able to distinguish failing right after that one entry from failing before we even get to it - the lastFetched would be the same in both cases. It won't have moved from the value given in initial sync. This is where we are in danger of duplicating a document as we don't know if we enqueued it or not. (The problem arises for the fact that we set lastFetched to a value that we don't always have (we in the prepare case), but we cannot avoid doing so, which is why we needed the enqueue option). The solution would be augment the OplogFetcher in the same place descibed above, except the new factor this time would be whether we have enqueued anything at all with this OplogFetcher so far. This would require the addition of a lifetime-scope boolean in the OplogFetcher which will be set after at least one document has been enqueued, here. If that's set, we will skip even with StartingPoint::kEnqueueFirstDoc. We wouldn't need to know anything about what restarts happened or how. The flag would be enough to tell us whether we've made any progress. |
| Comment by Vesselina Ratcheva (Inactive) [ 01/Apr/19 ] |
|
This is a bit tricky, since the place where we can make decisions about this (here) is different from the place where we detect restarts (here). OplogFetcher is a subclass of AbstractOplogFetcher. The cleanest solution I've found so far is the following: |