Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- milestone-1
- pm-2974-required

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Sprint:
Sharding EMEA 2023-03-20
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We require that cleanup of internal collections only fail for retryable reasons. However, it is possible that someone downgrading a cluster does not actually retry downgrading the FCV in such a situation. If instead the user tried to transition back to the upgraded FCV, we are relying on our upgrade code to properly handle rebuilding internal server data, which is hard to get right and also hard to test for.

This poses additional problems in sharded clusters where config servers and shard servers clean up their internal collections at different times. Allowing a transition to kUpgraded with partially cleaned up server metadata might mean that the cluster cannot rebuild what it needs on config servers or shard servers to properly function in the upgraded FCV.

Since this process is error prone and doesn't provide much in terms of safety guarantees, we should require that either internal server data cleanup hasn't started yet, or it is fully completed before being able to transition FCV to kUpgraded.

We will update the sharded cluster FCV downgrade process to be 3 phases, such that the new FCV state machine is Upgraded -> Downgrading -> CleaningServerMetadata -> Downgraded.

The CleaningServerMetadata phase is represented on disk with the isCleaningServerMetadata field. Upon entering the phase, the config server will persist a field isCleaningServerMetadata: true to its FCV document before starting to clean the server metadata. Once we are done fully cleaning up the server metadata throughout the whole cluster (config and shard servers), we will remove the field.

This way, if a sharded cluster receives a setFCV upgrade command, and is in the Downgrading FCV, the config server will check for the existence of the isCleaningServerMetadata field and will fail to upgrade if it exists.

We should test that if the config or the shard servers fail at any point during the internal server data cleanup, we fail to transition to kUpgraded.

depends on

SERVER-74282 Require all internal server data cleanup be completed before allowing transition to upgraded for replica sets

Closed

is depended on by

SERVER-64917 Enable featureFlagDowngradingToUpgrading

Closed

Assignee:: Jordi Serra Torrens
Reporter:: Samyukta Lanka
Participants:: Githook User, Jordi Serra Torrens, Samyukta Lanka
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Feb 08 2023 07:47:29 PM UTC
Updated:: Oct 29 2023 09:26:40 PM UTC
Resolved:: Mar 20 2023 11:12:53 AM UTC
Confidence Status Last Update:: 15/Mar/23 4:04 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates