[SERVER-55766] Introduce an optimized "for restore" startup replication recovery mechanism Created: 02/Apr/21 Updated: 29/Oct/23 Resolved: 03/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.15, 4.4.7, 5.0.0-rc0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Matthew Russotto |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2
|
||||||||||||||||||||||||||||
| Sprint: | Repl 2021-05-03, Repl 2021-05-17 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Description |
|
After a restore users generally don't need to be able to roll back or do PIT reads earlier than the top of the oplog. Replication recovery can also be very long after a restore, and the stable/oldest timestamp cannot advance during replication recovery. This isn't great even with durable history, but can lead to very poor performance in 4.2 before durable history. We should provide a startup parameter, that when configured, applies oplog entries either:
We may have to set the initial data timestamp at the end of recovery to prevent rollbacks or reads before the timestamp at the end of recovery. We also need to consider what happens when the nodes crashes halfway through recovery, and make sure it doesn't corrupt data in that case. This should only be supported and used in Atlas. Note that if a rollback were necessary to a point before the end of the recovery, the rollback would fail unrecoverably. If the restore was used to seed a new replica set, it is not expected that a node in that set would roll back to a point before the last seeded oplog entry. Credit to lingzhi.deng for this idea. |
| Comments |
| Comment by Githook User [ 12/May/21 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}Message: (cherry picked from commit 40b7635321caadb7219f7d990a049a93d9776490) |
| Comment by Githook User [ 11/May/21 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}Message: (cherry picked from commit 8a7e9a21fd0e10ddc1b41345e5bea1a82141061b) |
| Comment by Githook User [ 29/Apr/21 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}Message: |
| Comment by Judah Schvimer [ 06/Apr/21 ] |
|
If recovering safely after a crash mid-recovery in "for restore"-mode were difficult, we could maybe require that the entire restore process must succeed in one go, possibly by setting the initial sync flag at the beginning of recovery. I suspect we don't actually want to advance the stable timestamp during recovery, since that would be a "lie" about the majority commit point. After transitioning out of recovery, correcting this lie would be difficult. Rather, we would want to set the InitialDataTimestamp to Timestamp::kAllowUnstableCheckpointsSentinel. This would ensure we take unstable checkpoints periodically during recovery that allow the storage engine to evict cache and store less history. We have code for recovering from an unstable checkpoint, which I think is (only?) used when a node has the takeUnstableCheckpointOnShutdown flag set. |
| Comment by Eric Milkie [ 02/Apr/21 ] |
|
I'm not sure option 1 will be very effective, as the untimestamped writes will still create history, it's just complicated how they work and there might still be bugs with how they are handled for durable history. I think we should pursue option 2 first when we implement this. |