Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-70949

Condition variable wait might lead to liveness failure and then crash a node

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 6.2.0-rc0
    • Component/s: Sharding
    • None
    • ALL
    • Sharding EMEA 2022-11-14
    • 135

      SERVER-69886 added code to prevent delaying shutting down a node a when range deletion task was ongoing. As part of these changes, a new object was added, ReadyRangeDeletionsProcessor with the sole purpose of being responsible for processing the actual enqueued range deletions.

      However, in the constructor of this object a wait for the queue's conditional variable was added (besides the internal wait of the processor's thread), and this object get's created in the onStepUpComplete callback, this is a problem because this is called by the ReplicationCoordinator under the RSTL lock, so, if there are no range deletion tasks enqueued that would unblock this thread, then a future stepDown would never work, because it would never acquire the RSTL lock. And given the fact that the object will never be reset (which would be another way of awakening the thread) the process will never serve operations, and will eventually crash on the next stepDown. In the attached stacktrace such situation can be seen happening, in Thread 1 a stepDown thread crashes the node, and in Thread 62 we can see the stepUp thread holding the RSTL lock.

      There shouldn't be a wait in the ReadyRangeDeletionsProcessor's constructor, however, there should be some extra investigation to ensure the same guarantees currently provided by the object.

            Assignee:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Reporter:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: