Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15750

Deadlock cycle in replication among oplog producer, oplog application and replication executor threads

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.7.8
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • ALL

      What follows is a description of a deadlock cycle observed while running maintenance_non-blocking.js. In summary, the error is that one cannot wait on the replication executor while holding the bgsync mutex.

      The oplog application thread periodically calls tryToGoLiveAsSecondary(), which acquires the global lock in shared (S) mode, and then calls getMaintenanceMode() on the replication coordinator, which schedules and waits for a callback on the replication executor.

      The oplog producer thread locks the bgsync mutex (BackgroundSync::_mutex), and then tries to acquire the global intent exclusive (IX) lock, blocking behind the oplog application thread.

      A third thread runs setMaintenanceMode, which blocks in the replication executor trying to clear the sync source in the producer thread, which requires the bgsync mutex.

      So, the executor is blocked in the setMaintenanceModeHelper waiting for the bgsync mutex, but the bgsync mutex is held by the oplog producer, which is waiting for the global lock in IX mode which is blocked by the oplog application thread, which holds the global lock in S mode and is waiting for a callback to run through the replication executor.

            Assignee:
            schwerin@mongodb.com Andy Schwerin
            Reporter:
            schwerin@mongodb.com Andy Schwerin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: