A DDL lock can be acquired when the DDL service state is not PrimaryAndRecovered

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 8.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • Fully Compatible
    • ALL
    • CAR Team 2024-04-01, CAR Team 2024-04-15
    • 9
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      A DDL lock should not be acquired when the DDL service state is not kPrimaryAndRecovered.

      Small explanation
      This is because we must ensure that a DDL operation that is interrupted and has to release the DDL lock due to a step-down will be the next one acquiring that DDL lock (after stepping up). Therefore, no one else must take a DDL lock right after a DDL operation is interrupted.

      Here is a sequence of events that will lead to CheckMetadataConsistency to acquire the DDL lock in the middle of a Resharding operation:

      1. Resharding operation starts and acquires the DDL lock.
      2. CheckMetadataConsistency starts and gets blocked waiting for the DDL lock.
      3. Stepdown starts
      4. Stepdown thread stops all the PrimaryOnlyServices.
      5. Resharding operation releases its DDL lock because it's a PrimaryOnlyService.
      6. CheckMetadataConsistency acquires the DDL lock.
      7. Stepdown thread kills all the interruptible opCtx. This will kill CheckMetadataConsistency operation but it may be late, the operation could have already finished.

      Suggested solution
      We must ensure the _state is still kPrimaryAndRecovered once the lock is acquired.

            Assignee:
            Silvia Surroca
            Reporter:
            Silvia Surroca
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: