Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-76546

_migrateClone can deadlock with prepared transactions on secondaries

    • Sharding NYC
    • Fully Compatible
    • ALL
    • v7.0, v6.0, v5.0
    • Sharding NYC 2023-05-29
    • 105
    • 5

      The _migrateClone command isn't supposed to run on secondaries, but we perform no
      checks for whether the node is currently primary after first admitting the command. If this command is still running while in the secondary state, it is possible for this deadlock to occur:

      • The command blocks on a prepared transaction while holding the PBWM lock
      • Oplog batch application blocks because the PBWM lock is held
      • Any attempt to end the prepared transaction would never be processed because oplog application is stuck.

      Note that this deadlock requires the _migrateClone to start on a primary and then block on a prepared transaction after the node steps down to secondary.

      The issue can most easily be solved in one of two ways:

      • Fail the _migrateClone command after taking the AutoGetActiveCloner and checking that the node is still primary. This helper takes the AutoGetCollection, which in turn takes the RSTL so that we can guarantee the node stays primary while it is in scope.
      • If for some reason we want this command to proceed while secondary, we could use AutoGetCollectionForRead or AutoGetCollectionForReadLockFree, which both skip taking the PBWM lock.

            randolph@mongodb.com Randolph Tan
            louis.williams@mongodb.com Louis Williams
            0 Vote for this issue
            8 Start watching this issue