[SERVER-73778] Require all internal server data cleanup as part of FCV downgrade be completed before allowing transition to kUpgraded for sharded clusters Created: 08/Feb/23  Updated: 29/Oct/23  Resolved: 20/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Samyukta Lanka Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: milestone-1, pm-2974-required
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-74282 Require all internal server data clea... Closed
is depended on by SERVER-64917 Enable featureFlagDowngradingToUpgrading Closed
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Sprint: Sharding EMEA 2023-03-20
Participants:

 Description   

We require that cleanup of internal collections only fail for retryable reasons. However, it is possible that someone downgrading a cluster does not actually retry downgrading the FCV in such a situation. If instead the user tried to transition back to the upgraded FCV, we are relying on our upgrade code to properly handle rebuilding internal server data, which is hard to get right and also hard to test for.

This poses additional problems in sharded clusters where config servers and shard servers clean up their internal collections at different times. Allowing a transition to kUpgraded with partially cleaned up server metadata might mean that the cluster cannot rebuild what it needs on config servers or shard servers to properly function in the upgraded FCV.

Since this process is error prone and doesn't provide much in terms of safety guarantees, we should require that either internal server data cleanup hasn't started yet, or it is fully completed before being able to transition FCV to kUpgraded.

We will update the sharded cluster FCV downgrade process to be 3 phases, such that the new FCV state machine is Upgraded -> Downgrading -> CleaningServerMetadata -> Downgraded.

The CleaningServerMetadata phase is represented on disk with the isCleaningServerMetadata field. Upon entering the phase, the config server will persist a field isCleaningServerMetadata: true to its FCV document before starting to clean the server metadata. Once we are done fully cleaning up the server metadata throughout the whole cluster (config and shard servers), we will remove the field.

This way, if a sharded cluster receives a setFCV upgrade command, and is in the Downgrading FCV, the config server will check for the existence of the isCleaningServerMetadata field and will fail to upgrade if it exists.

We should test that if the config or the shard servers fail at any point during the internal server data cleanup, we fail to transition to kUpgraded.



 Comments   
Comment by Githook User [ 20/Mar/23 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-73778 Require all internal server data cleanup as part of FCV downgrade be completed before allowing transition to kUpgraded for sharded clusters
Branch: master
https://github.com/mongodb/mongo/commit/29dec97a718571dbef8ac3ffe4c68115e5874909

Generated at Thu Feb 08 06:25:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.