Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.4.0-rc7, 4.7.0
Affects Version/s: None
Component/s: Storage
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v4.4
Sprint:
Execution Team 2020-06-01
Linked BF Score:
22
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Replication is currently running into this deadlock issue, where the LastVote operation is using an Uninterruptible lock guard

LastVote
    waitUntilDurable
        _lastSyncMutex
            getToken
                refreshIfPrimary
                    AutoGetCollection (stuck on RSTL IX)

Stepdown (RSTL X)
    waitUntilDurable
        _lastSyncMutex (stuck on LastVote's acquisition)

Regardless of the uninterruptible lock guard, it is bad practice to take a mutex and then locks.

I recommend either of the two options (prefer #2):

1) waitUntilDurable should not call getToken under the mutex, but instead take the mutex after doing the getToken call that takes locks.
I think this is workable, because getToken (which calls refreshOplogTruncateAfterPointIfPrimary) has its own mutex for atomicity reading from the oplog and updating the oplogTruncateAfterPoint. waitUntilDurable then calls onDurable with the result of getToken after the journal flush, but replication already has protections against going backwards in time.

2) Remove waitUntilDurable's _lastSyncMutex because it is no longer needed for performance. I haven't verified that performance will be unaffected, but I believe it won't be affected from looking at the code. We've recently moved writeConcern's waitUntilDurable calls onto the async JournalFlusher thread, which has its own caller batching. Without the write callers, the only other waitUntilDurable callers are one offs for uncommon replication events. Batching doesn't seem like it needs to continue to occur in the waitUntilDurable code layer.

is duplicated by

SERVER-48125 Stepdown can deadlock with storing lastVote via journal flusher

Closed

is related to

SERVER-48143 Remove support for 'fsync' flag in writeConcern specifications leftover from mmap

Open

related to

SERVER-80432 waitForSession can miss JournalListener onDurable calls

Closed

Assignee:: Dianna Hohensee (Inactive)
Reporter:: Dianna Hohensee (Inactive)
Participants:: Dianna Hohensee, Githook User, Lingzhi Deng
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: May 12 2020 04:53:46 PM UTC
Updated:: Oct 29 2023 10:08:20 PM UTC
Resolved:: May 20 2020 09:09:19 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty