Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-48144

waitUntilDurable should not take a mutex before taking locks

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4.0-rc7, 4.7.0
    • Component/s: Storage
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v4.4
    • Sprint:
      Execution Team 2020-06-01
    • Linked BF Score:
      22

      Description

      Replication is currently running into this deadlock issue, where the LastVote operation is using an Uninterruptible lock guard

      LastVote
          waitUntilDurable
              _lastSyncMutex
                  getToken
                      refreshIfPrimary
                          AutoGetCollection (stuck on RSTL IX)
      

      Stepdown (RSTL X)
          waitUntilDurable
              _lastSyncMutex (stuck on LastVote's acquisition)
      

      Regardless of the uninterruptible lock guard, it is bad practice to take a mutex and then locks.

      I recommend either of the two options (prefer #2):

      1) waitUntilDurable should not call getToken under the mutex, but instead take the mutex after doing the getToken call that takes locks.
      I think this is workable, because getToken (which calls refreshOplogTruncateAfterPointIfPrimary) has its own mutex for atomicity reading from the oplog and updating the oplogTruncateAfterPoint. waitUntilDurable then calls onDurable with the result of getToken after the journal flush, but replication already has protections against going backwards in time.

      2) Remove waitUntilDurable's _lastSyncMutex because it is no longer needed for performance. I haven't verified that performance will be unaffected, but I believe it won't be affected from looking at the code. We've recently moved writeConcern's waitUntilDurable calls onto the async JournalFlusher thread, which has its own caller batching. Without the write callers, the only other waitUntilDurable callers are one offs for uncommon replication events. Batching doesn't seem like it needs to continue to occur in the waitUntilDurable code layer.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              dianna.hohensee Dianna Hohensee
              Reporter:
              dianna.hohensee Dianna Hohensee
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: