[SERVER-55032] Get rid of ShardingStateRecovery once 5.0 branches out Created: 08/Mar/21  Updated: 05/Dec/22  Resolved: 21/Sep/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Won't Do Votes: 0
Labels: 5.0-cleanup, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-49973 Get rid of RecoveryDocument Closed
Related
related to SERVER-60126 Complete TODO listed in SERVER-55032 Closed
is related to SERVER-60109 Ensure vector clock is recovered on s... Closed
is related to SERVER-60110 Get rid of ShardingStateRecovery once... Closed
Assigned Teams:
Sharding EMEA
Sprint: Sharding EMEA 2021-10-18
Participants:

 Description   

The ShardingStateRecovery machinery can be fully replaced by the waitFor* and recovery methods offered by the VectorClockMutable.



 Comments   
Comment by Githook User [ 08/Oct/21 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-60126 Swap TODO SERVER-55032 with SERVER-60110
Branch: master
https://github.com/mongodb/mongo/commit/91da50fc2c4cd81486d8a4a9fd220c43ef7f6ddd

Comment by Pierlauro Sciarelli [ 21/Sep/21 ]

Closing because of the reasons explained in the comment.

Opened SERVER-60109 and SERVER-60110 to follow up.

Comment by Jordi Serra Torrens [ 11/Aug/21 ]

ShardingStateRecovery cannot yet be thrown out on 5.1.

Some context first: ShardingStateRecovery does the following on stepup
a. If there is no active 'metadata operation' ongoing, it recovers the configOpTime persisted in the minOpTimeRecovery document.
b. If there was any active 'metadata operation' ongoing, it essentially does a linearizable read onto the configsvr which then advances the local knowledge of the configTime to that gossiped in the "linearizable read" response.

Currently, ShardingStateRecovery is critical for the correctness of these two operations: chunk migration and movePrimary.

  • Chunk migration: Before the donor writes the migration decision on the migrationCoordinator document, it is necessary that we durably persist a configTime that is inclusive of the migration commit. This ensures that in case of donor failover, the new primary will know of a configTime inclusive of the migration. This is relevant in failovers that happen when either the migration decision has been made durable, or in cases where the coordinatorDocument has already been deleted.
    In case of failovers that happen when the migrationDecision has not been yet set, notice that the new primary essentially does a linearizable read onto the configsvr here in ensureChunkVersionIsGreaterThan, which will recover a configTime inclusive of any possible commit the previous primary did. This is to explain that for migrations, we don't really need the guarantee (b) of ShardingStateRecovery. We are just interested on (a).
  • MovePrimary: MovePrimary does not implement any custom recovery logic, so it relies on both guarantees from ShardingStateRecovery.

The guarantee (a) could be replaced by the VectorClock::waitForDurableConfigTime. This way, ShardingStateRecovery would no longer need to make configOpTime durable, and instead we'll rely on the VectorClock recovery. ShardingStateRecovery would still be needed for guarantee (b) required by movePrimary.

However, it is not possible to throw away (a) in 5.1. The reason is that in multi-version clusters (5.0 <--> 5.1), we never recover the VectorClock on stepup. Because of that, the following situations could happen:

  • A v5.0 migration donor (only durably persists configOpTime through the ShardingStateRecovery) stepsdown. A v5.1 secondary steps up, attempts to recover the configTime only through the VectorClock, so it won't see the ShardingStateRecovery's configOpTime. (Albeit this one could be addressed by keeping the ShardingStateRecovery's recovery of configOpTime in 5.1)
  • A v5.1 migration donor (consider we make it only durably persists configTime through VectorClock) stepsdown. A v5.0 primary steps up, but since it only recovers the ShardingStateRecovery's configOpTime it won't see the persisted time in vectorClock.

Similar reasoning should apply to movePrimary


Possible course of action:

  • On v5.1
    • Run VectorClock::recover() (in addition to the ShardingStateRecovery::recover()) on stepup
    • In ShardingStateRecovery::endMetadataOp(), also make configTime durable on VectorClock
  • On v6.1
    • Change migration to no longer use ShardingStateRecovery and use just VectorClock::waitDurable
    • ShardingStateRecovery no longer reads or updates it's own configOpTime (we can just rely on VectorClock now)
Generated at Thu Feb 08 05:35:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.