Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-87931

MovePrimary + FCV upgrade / downgrade race may dodge FCV cleanup & checks

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Catalog and Routing
    • ALL

      As described for a particular case in SERVER-87927, we have a race possible where a movePrimary interleaves with the setFeatureCompatibilityVersion upgrade / downgrade checks & cleanup actions and bypasses them. To break this down:
      1. Let's say we have a collection living on Shard A that would trigger the checks in _userCollectionsUassertsForDowngrade(...).
      2. An FCV downgrade command is received, and the config server tells both shards to start downgrading.
      3. Shard A and Shard B both finish reaching the "transitioning" state.
      4. A movePrimary operation starts to copy the problematic collection from Shard A to Shard B.
      5. The movePrimary runs before Shard A performs the user collection checks, but after Shard B has already completed them. Therefore, the collection is copied over from Shard A to Shard B and isn't caught on Shard B's end, since Shard B completed them before the collection was copied over.
      6. After the movePrimary, Shard A runs the user collection checks. But there's nothing to check for (the collection was already migrated), so it passes.
      7. Therefore the FCV downgrade completes without triggering any checks.

      We may have to rethink when / where we call _userCollectionsUassertsForDowngrade and _internalServerCleanupForDowngrade, or maybe if we need to disallow migrations while we're in the "transitioning" FCV state. While the example above uses movePrimary, as a part of this ticket we should make sure such a bug isn't possible with the other migration methods we have like moveChunk or resharding.

      This bug was found when I added code to the _internalServerCleanupForDowngrade() function.

      And it seems like SERVER-87297 is a particular instance of this bug.

            Assignee:
            backlog-server-catalog-and-routing [DO NOT USE] Backlog - Catalog and Routing
            Reporter:
            vishnu.kaushik@mongodb.com Vishnu Kaushik
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: