[SERVER-35952] Secondaries can fall off the oplog even if necessary oplog entries exist in cluster Created: 03/Jul/18 Updated: 27/Oct/23 Resolved: 29/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Kovacs | Assignee: | Backlog - Replication Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Replication
|
| Operating System: | ALL |
| Steps To Reproduce: | Consider a replica set chaining as follows: P -> S1 -> S2 Consider 4 points in time: t1 -> t2 -> t3 -> t4
|
| Participants: | |
| Case: | (copied to CRM) |
| Description |
|
When evaluating sync sources, we only consider current state and lag, not which oplog entries each candidate has. This can lead to situations where a chained secondary falls off the oplog when switching sync sources (due to sync source re-evaluation) even though the necessary oplog entries exist in the replica set as a whole. |
| Comments |
| Comment by James Kovacs [ 03/Jul/18 ] |
|
It wasn't clear from the log entries that it was actually using that lagging secondary as a sync source in between the re-evaluation attempts. It appeared to be looping through sync sources repeatedly until it fell off the oplog. What actually happened - when I cross-referenced the logs from the other members - is that they were all looping through trying to find a better sync source, selecting the lagging sync source (because they were too stale to sync from the primary directly), and replicating from it until the next re-evaluation attempt. When the lagging sync source fell off the oplog, all secondaries in the remote DC fell off together. I don't think there is anything better we could do in such a situation. We can close this as "By Design". |
| Comment by Eric Milkie [ 03/Jul/18 ] |
|
That part that Judah wrote is true, but we actually make two passes through the candidate list: one with the restrictions, and a second one without the restrictions, in case no candidate came out of the first pass. See comment at https://github.com/mongodb/mongo/blob/57d7938c49da06122d4d43054ff89e1881d0209f/src/mongo/db/repl/topology_coordinator.cpp#L317 |
| Comment by James Kovacs [ 03/Jul/18 ] |
|
From Sync Source Selection in Replication Internals:
|
| Comment by Eric Milkie [ 03/Jul/18 ] |
|
I'm confused about step 6 in your repro steps. I didn't think we had code logic that did that? Certainly, if the node thinks itself is lagged by more than 30 seconds, it will rescan sync source candidates, but I didn't think we blacklisted any candidates due to their perceived lag. |