[SERVER-29215] Coordinate oplog truncate point with checkpoint timestamp Created: 15/May/17 Updated: 06/Dec/22 Resolved: 01/Aug/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Alexander Gorrod | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Once MongoDB switches to journaling only the oplog, it will be necessary to ensure that the oplog is not truncated to a point in time that is after the last completed checkpoint. MongoDB should track the timestamp for the last completed checkpoint, and ensure that oplog truncate doesn't remove any content newer than that point in time. |
| Comments |
| Comment by Daniel Gottlieb (Inactive) [ 01/Aug/17 ] |
|
The cases this ticket is addressing are being tracked in |
| Comment by Spencer Brody (Inactive) [ 01/Aug/17 ] |
|
In the case of a single node falling off the back of its own oplog, there is actually still a way for that node to recover - it could completely clear its oplog then restart replication starting at the checkpointTimestamp. The tricky part is that doing that automatically would probably not be acceptable as doing so can cause that node to temporarily lose writes it had previously acknowledged (though that is a problem if the node needs to resync as well). |
| Comment by Spencer Brody (Inactive) [ 01/Aug/17 ] |
|
This ticket is specifically about the issue where the checkpointTimestamp on a given node is behind the oldest oplog entry on that same node. In that case, if that node needs to go into replication recovery, either due to being restarted or from going into ROLLBACK, then the node will be unable to recover as it won't have the oplog entries it needs to roll forward. In that case the node will be stuck and need to be resynced. For the checkpointTimestamp on a node to fall behind the back of its own oplog, that means the replication majority commit point must also be behind the back of that same oplog. The case of a majority of nodes falling off the back of the primary's oplog is already a problem today that can result in large volumes of data being rolled back, or the majority of the set needing to be resynced, resulting in loss of write availability. I don't believe the issue of the primary falling off the back of its own oplog is significantly worse or meaningfully exacerbates the issue - if that node crashed it would wind up trying to rollback all the data since the checkpointTimestamp anyway, which as Dan points out is likely to be more data than the old rollback algorithm could wind up rolling back anyway. So I would agree with Dan that this ticket shouldn't be considered a required part of this epic, as this project doesn't increase the problems meaningfully more than the problems that already exist with being in this situation. That said, I still think we should consider growing the oplog whenever the replication commit point falls behind the back of where it would otherwise be truncated to, but that should be considered as an independent improvement unrelated to the "Recover to a timestamp" project. I filed |
| Comment by Daniel Gottlieb (Inactive) [ 26/Jul/17 ] |
|
This ticket is about extending the size of the oplog greater than its prescribed value. It's unclear how much leeway we really have from a behavior change standpoint. There's a risk that by not taking checkpoints as often in a lagged replica set as well as needing all the oplog from the stable timestamp to successfully recover, a replication rollback may introduce more cases of nodes falling off the oplog. I'll provide an argument that leaving oplog truncation behavior as-is, won't introduce many meaningful new cases of failing to have enough oplog. The most common replica set configuration is with 3 nodes. If the stable timestamp is beginning to lag far behind, then both secondaries are slow to replicate. Suppose there's an election that appoints one of these secondaries as the new primary. The former primary, when it can reconnect with its peers must now rollback. If the new recover to stable timestamp algorithm has truncated too much oplog off the old primary, I postulate the 3.4 rollback code would also find it cannot unwind the oplog far enough to recover either. Both cases ultimately requiring a resync. I don't think this scenario holds as well for 4+ node replica sets where the primary and 1+ other nodes (but still a minority) are mostly keeping up, but a majority are far behind. An election that appoints a relatively recent node as primary would be a small rollback in 3.4 for the former primary, but a large and perhaps impossible rollback(roll-forward) with the recover to stable timestamp algorithm if this oplog truncation change is not instituted. |
| Comment by Alexander Gorrod [ 15/May/17 ] |
|
It's possible that this isn't necessary, but the idea of when it's safe to truncate the oplog deserves thought after the other work this ticket depends on has been implemented. |