Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-76834

Circular wait dependency between PBWM and ticket acquisition

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 7.1.0-rc0
    • Affects Version/s: 7.0.0-rc0, 6.0.5, 5.0.17, 4.4.21, 6.3.1
    • Component/s: Storage
    • Labels:
      None
    • Storage Execution
    • Fully Compatible
    • ALL
    • v7.0
    • Execution NAMR Team 2023-08-21
    • 107

      Any thread which takes the PBWM and a ticket has the potential to introduce a three-way deadlock in the presence of prepared transactions.

      The deadlock can be summarised like this:

      • T1 (Regular secondary read thread): N of them (where N == number of available tickets) acquire tickets, but end up waiting on a prepared transaction to commit. This transaction's commit happens to be in a subsequent batch, therefore this thread will not be unblocked until Oplog application makes forward progress.
      • T2 (Internal thread): Takes the PBWM in some intent mode (IS) and blocks waiting on a ticket acquisition.
      • T3 (Oplog application thread): Tries to take the PBWM in mode X, but blocks behind T2, which holds the PBWM.

      We now have a wait cycle of (T3) -> (T2) -> (T1) -> (T3).

      This demonstrates that it is not safe to block on a ticket acquisition while holding the PBWM and there are at least two places where this can happen:

      This (more general) ticket is for the StorEx team to tighten up the locking rules in order to prevent deadlocks like these from being introduced. One possibility is to add an invariant that we can't take both PBWM and a ticket - a thread must choose one.

            Assignee:
            gregory.noma@mongodb.com Gregory Noma
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: