Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-70873

Stepdown during drop collection can lead to a deadlock

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.2.0-rc0
    • Affects Version/s: 6.2.0-rc0
    • Component/s: Sharding
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14
    • 135

      As part of SERVER-65016 a new code to remove the range deletion document was added as an optimization into the existing drop collection code, however, this code is using an alternative client region in order to remove multiple documents, this is done because the shardsvr_drop_collection_participant uses the retryable write machinery to guard against replay protection.

      The unintended effect of this is that a thread that is dropping a collection will first checkout a session, and then, as part of taking the collection lock it will try to grab the RSTL lock when executing the DBClient command to remove the range deletion documents. If a stepdown sneaks in after the session is checked out, then the stepdown thread will grab the RSTL lock and then try to checkout and kill all running sessions, causing a deadlock.

      In the attached stacktrace log this situation can be seen between the Thread 2 and Thread 99. One way to solve this is to do create the operation context the same way the rename collection metadata command does, which is, linking the new operation context created in the alternative region to the parent cancellation token, this way, during the stepdown, when the parent operation context is interrupted, the thread waiting for the lock will finish, liberating the session, allowing the shutdown thread to effectively checking it out.

            Assignee:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Reporter:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: