Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-76884

[v4.4] Chunk migration recovery can deadlock on stepup taking MigrationBlockingGuard (v4.4 only)

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.4.23
    • Affects Version/s: 4.4.0
    • Component/s: None
    • Labels:
    • Sharding EMEA
    • Fully Compatible
    • ALL
    • Sharding EMEA 2023-05-15
    • 31

      [Note this can only happen on v4.4. v5.0 and onwards no longer take the MigrationBlockingGuard on stepup]

      The deadlock is the following:
      (1) The OplogApplier thread is holding the RSTL lock in MODE_X for stepup and is blocked trying to acquire the MigrationBlockingGuard.
      (2) A chunk migration donor is still registered in the active migration registry which means that (1) will be blocked when it tries to acquire the MigrationBlockingGuard. The migration is trying to perform a write on the local replica set but it is blocked trying to acquire the RSTL lock (held by (1))

      The (rare) interleaving that trigges this deadlock may happen when a request the donor sent to the recipient took a long time to be processed – enough time for a particular donor node to step down and up back again (the same node).

        1. repro-bf-28646.patch
          2 kB
          Jordi Serra Torrens

            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            0 Vote for this issue
            3 Start watching this issue