Reject network commands with an inconsistent Operation FCV

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 8.3.0-rc0, 8.2.0
    • Component/s: None
    • None
    • Catalog and Routing
    • 🟩 Routing and Topology
    • None
    • None
    • None
    • None
    • None
    • None

      SERVER-99557 / SERVER-99558 added a generic versionContext command argument that allows requesting that a command runs using a given FCV (e.g. for feature flag checks). This is used by ShardingDDLCoordinators to use a single FCV snapshot (Operation FCV) throughout their lifetime, even as setFCV concurrently transitions to new global FCV in each shard. After transitioning, setFCV wait for any ShardingDDLCoordinators using an stale FCV snapshot to complete. Then, setFCV can upgrade the server metadata knowing that all operations are using the new FCV.

       

      Under unreliable networks, node stepdowns, etc., a ShardingDDLCoordinator may need to retry network commands. It is possible that the ShardingDDLCoordinator completes due to the retry, allowing setFCV to do the metadata upgrade, but then the first network attempt arrives with a stale FCV.

       

      Currently, we admit those commands, assuming they have no ill effect because retrying a previously completed action of a ShardingDDLCoordinator should be a no-op. However this assumption is flimsy and it would be safer to reject those commands. This requires setFCV to persist a flag/phase after all commands are expected to use the new FCV.

            Assignee:
            Unassigned
            Reporter:
            Joan Bruguera Micó
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: