Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-73778

Require all internal server data cleanup as part of FCV downgrade be completed before allowing transition to kUpgraded for sharded clusters

    • Replication
    • Fully Compatible
    • Sharding EMEA 2023-03-20

      We require that cleanup of internal collections only fail for retryable reasons. However, it is possible that someone downgrading a cluster does not actually retry downgrading the FCV in such a situation. If instead the user tried to transition back to the upgraded FCV, we are relying on our upgrade code to properly handle rebuilding internal server data, which is hard to get right and also hard to test for.

      This poses additional problems in sharded clusters where config servers and shard servers clean up their internal collections at different times. Allowing a transition to kUpgraded with partially cleaned up server metadata might mean that the cluster cannot rebuild what it needs on config servers or shard servers to properly function in the upgraded FCV.

      Since this process is error prone and doesn't provide much in terms of safety guarantees, we should require that either internal server data cleanup hasn't started yet, or it is fully completed before being able to transition FCV to kUpgraded.

      We will update the sharded cluster FCV downgrade process to be 3 phases, such that the new FCV state machine is Upgraded -> Downgrading -> CleaningServerMetadata -> Downgraded.

      The CleaningServerMetadata phase is represented on disk with the isCleaningServerMetadata field. Upon entering the phase, the config server will persist a field isCleaningServerMetadata: true to its FCV document before starting to clean the server metadata. Once we are done fully cleaning up the server metadata throughout the whole cluster (config and shard servers), we will remove the field.

      This way, if a sharded cluster receives a setFCV upgrade command, and is in the Downgrading FCV, the config server will check for the existence of the isCleaningServerMetadata field and will fail to upgrade if it exists.

      We should test that if the config or the shard servers fail at any point during the internal server data cleanup, we fail to transition to kUpgraded.

            Assignee:
            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            Reporter:
            samy.lanka@mongodb.com Samyukta Lanka
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: