Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38660

fuzzer can cause tests to timeout when executing secondary reads with transactions

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 4.1.6
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Operating System:
      ALL
    • Sprint:
      Repl 2019-01-28

      Description

      Once a secondary read only transaction starts and gets stashed, the applier trying to get the global X will block (like trying to replicate create collection command). And once the X lock is queued, new requests to secondary will queue behind the X lock until it is satisfied/abandoned. This can cause a deadlock scenario like this:

      1. Txn1 starts, do ops, stash locks.
      2. Repl applier request global X, conflicts with stashed locks, and lock request gets queued.
      3. Txn1 continues, checks out session, tries to satisfy read concern, which involves checking if oplog collection exists that requires global IS, so it gets queued behind #2 (note: this is before locks gets unstashed).
      4. Periodic Txn Killer sees that Txn1 is already expired, tries to kill it by checking out the session, but it is blocked waiting for step#3 to check the session back in.

      More notes:
      Periodic Txn Killer actually kills the opCtx of session before trying to check it out, but step#3 is also blocked on pbwm resource mutex while trying to grab GlobalLock. And this operation doesn't use opCtx so it cannot be interrupted by killOp.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: