[SERVER-55766] Introduce an optimized "for restore" startup replication recovery mechanism Created: 02/Apr/21  Updated: 29/Oct/23  Resolved: 03/May/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.15, 4.4.7, 5.0.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Matthew Russotto
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
causes SERVER-81878 startupRecoveryForRestore may not pla... Closed
causes SERVER-81879 startupRecoveryForRestore can drop ta... Closed
Related
related to SERVER-55483 Add a new startup parameter that skip... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4, v4.2
Sprint: Repl 2021-05-03, Repl 2021-05-17
Participants:
Case:

 Description   

After a restore users generally don't need to be able to roll back or do PIT reads earlier than the top of the oplog.

Replication recovery can also be very long after a restore, and the stable/oldest timestamp cannot advance during replication recovery. This isn't great even with durable history, but can lead to very poor performance in 4.2 before durable history.

We should provide a startup parameter, that when configured, applies oplog entries either:

  1. without timestamps to create no history, or
  2. with timestamps, but advancing the stable/oldest timestamp between batches
    so that the storage engine can evict history.

We may have to set the initial data timestamp at the end of recovery to prevent rollbacks or reads before the timestamp at the end of recovery. We also need to consider what happens when the nodes crashes halfway through recovery, and make sure it doesn't corrupt data in that case.

This should only be supported and used in Atlas.

Note that if a rollback were necessary to a point before the end of the recovery, the rollback would fail unrecoverably. If the restore was used to seed a new replica set, it is not expected that a node in that set would roll back to a point before the last seeded oplog entry.

Credit to lingzhi.deng for this idea.



 Comments   
Comment by Githook User [ 12/May/21 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}

Message: SERVER-55766 Introduce an optimized "for restore" startup replication recovery mechanism

(cherry picked from commit 40b7635321caadb7219f7d990a049a93d9776490)
Branch: v4.2
https://github.com/mongodb/mongo/commit/2a99e03b813f33342ffe83ccc5df9b8d2c33bf08

Comment by Githook User [ 11/May/21 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}

Message: SERVER-55766 Introduce an optimized "for restore" startup replication recovery mechanism

(cherry picked from commit 8a7e9a21fd0e10ddc1b41345e5bea1a82141061b)
Branch: v4.4
https://github.com/mongodb/mongo/commit/40b7635321caadb7219f7d990a049a93d9776490

Comment by Githook User [ 29/Apr/21 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}

Message: SERVER-55766 Introduce an optimized "for restore" startup replication recovery mechanism
Branch: master
https://github.com/mongodb/mongo/commit/8a7e9a21fd0e10ddc1b41345e5bea1a82141061b

Comment by Judah Schvimer [ 06/Apr/21 ]

If recovering safely after a crash mid-recovery in "for restore"-mode were difficult, we could maybe require that the entire restore process must succeed in one go, possibly by setting the initial sync flag at the beginning of recovery.

I suspect we don't actually want to advance the stable timestamp during recovery, since that would be a "lie" about the majority commit point. After transitioning out of recovery, correcting this lie would be difficult. Rather, we would want to set the InitialDataTimestamp to Timestamp::kAllowUnstableCheckpointsSentinel. This would ensure we take unstable checkpoints periodically during recovery that allow the storage engine to evict cache and store less history. We have code for recovering from an unstable checkpoint, which I think is (only?) used when a node has the takeUnstableCheckpointOnShutdown flag set.

Comment by Eric Milkie [ 02/Apr/21 ]

I'm not sure option 1 will be very effective, as the untimestamped writes will still create history, it's just complicated how they work and there might still be bugs with how they are handled for durable history. I think we should pursue option 2 first when we implement this.

Generated at Thu Feb 08 05:37:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.