Fix edge cases for draining of chunk operations on setFCV for Authoritative Shards

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Catalog and Routing
    • CAR Team 2026-06-22, CAR Team 2026-07-06
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We want to guarantee that:

      • During upgrade (8.X FCV -> 9.0 FCV) no legacy chunk operations should be run once the FCV is fully upgraded to 9.0.
      • During downgrade (9.0 FCV -> 8.X FCV) no authoritative chunk operation should be run once the FCV is fully downgraded.

       

      For Authoritative Shards, since SERVER-127654 we stop and drain chunk migrations during setFCV so that we create a barrier between nonauthoritative & authoritative migrations. We did this by blocking the registration of new migrations in the ActiveMigrationsRegistry, and draining registered migrations. However, there are two edge cases we missed:

       

      Edge case 1: Authoritative chunk op coordinator that has not yet registered in the ActiveMigrationsRegistry

      Scenario: A chunk command has decided it will act authoritatively and then created an authoritative chunk op coordinator. However the chunk op coordinator still has not registered the migration in the ActiveMigrationsRegistry.

      Consequence: We will miss it on the draining (which only considers the ActiveMigrationsRegistry), hit this tassert and fail setFCV.

      Fix: Remove the tassert; the chunk op coordinator will drain along the rest of sharding coordinators, and it will fail to register the migration due to the ActiveMigrationsRegistry guardrails.

      Edge case 2: Chunk command has taken a decision on authoritativeness but not yet registered anything

      Scenario: A chunk command has decided it will act either authoritatively or non-authoritatively but still has neither created an authoritative chunk op coordinator nor registered a nonauthoritative chunk op.

      Consequence: We will miss it on the draining and continue setFCV. It is likely that the migration hits the configsvr transitional FCV guardrail when it attempts to commit. However in the worst case it can survive the full FCV transition and then execute afterwards, which can create an inconsistency between the shard authoritative catalog and global catalog.

      Fix: Take a FixedFCVRegion for the decision (until either the authoritative chunk op coordinator is created or the legacy migration is registered in the ActiveMigrationsRegistry). In both cases, the migration will eventually fail to register the migration in the ActiveMigrationsRegistry and fail.

            Assignee:
            Joan Bruguera Micó
            Reporter:
            Pol Pinol
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: