[SERVER-48125] Stepdown can deadlock with storing lastVote via journal flusher Created: 12/May/20  Updated: 06/Dec/22  Resolved: 20/May/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vesselina Ratcheva (Inactive) Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-48144 waitUntilDurable should not take a mu... Closed
Problem/Incident
is caused by SERVER-47612 Elections not robust in remove_newly_... Closed
Related
Assigned Teams:
Replication
Operating System: ALL
Participants:
Linked BF Score: 25

 Description   

As part of storing the lastVote document, we will wait for it to be durable, and we will eventually call into refreshOplogTruncateAfterPointIfPrimary. This needs to acquire a global IX lock as part of an AutoGetCollection. This can deadlock with stepdown, as it tries to clear the oplog truncate after point, which in turn waits on a journal flush. The journal flusher needs to be able to run wait for durability too, but it cannot get to the critical section as that is protected by a mutex which is already held by the lastVote thread.

We recently made storing the lastVote document fully uninterruptible in SERVER-47612.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 20/May/20 ]

Yes! This ticket should be fixed by SERVER-48144's commit and can now be closed.

Comment by Judah Schvimer [ 20/May/20 ]

dianna.hohensee, is it safe for me to close this ticket as a duplicate of SERVER-48144, or is there more work to do here?

Comment by Dianna Hohensee (Inactive) [ 12/May/20 ]

We will have to add some kind of retryability on stepdown interrupt to JournalFlusher::waitForJournalFlush in SERVER-48149, so that callers can eventually succeed.

Stepdown toggles off updating the oplogTruncateAfterPoint, then makes sure a round of flushing finishes in the JournalFlusher to clear the system before at last unsetting the oplogTruncateAfterPoint. So retrying immediately after that is safe: journal flush without the no longer needed oplogTruncateAfterPoint update that is causing the problems. It's just the race of waitUntilDurable callers not going through the JournalFlusher.

Comment by Judah Schvimer [ 12/May/20 ]

LastVote will still have to wait for stepdown to finish, I think, to get the RSTL lock.

That's fine, waiting a bit is strictly better than voting no when a node doesn't need to.

making lastVote use the JournalFlusher that stepdown interrupts properly.

If it's interrupted, then we'd still have the original problem where nodes vote no when they shouldn't have to, right? Or would that not be interrupted on stepdown?

Comment by Dianna Hohensee (Inactive) [ 12/May/20 ]

I believe so. Without the mutex, lastVote can't hold something that stepdown needs in order to proceed. LastVote will still have to wait for stepdown to finish, I think, to get the RSTL lock.

Lingzhi found a BF where the oplogTruncateAfterPoint is magically set after stepdown clears it, probably by a concurrent thread doing waitUntilDurable during stepdown like lastVote, so I've also filed SERVER-48149 to stop that kind of nonsense. Also a fix for this, now that I think about it, making lastVote use the JournalFlusher that stepdown interrupts properly.

Comment by Judah Schvimer [ 12/May/20 ]

Great! So, to ensure I understand correctly, if SERVER-48144 goes according to plan, this deadlock can be closed as Gone Away?

Comment by Dianna Hohensee (Inactive) [ 12/May/20 ]

Okay, I've filed SERVER-48144 because execution should not be taking a mutex before regular locks – it's bad practice and leads to deadlocks like this one --, regardless of UninterruptibleLockGuards.

SERVER-48144 is quick in theory if all goes well – looking at the code it seems very reasonable --, but will need some performance checks; and it's always possible something could implode when we change something low level in the code.

Comment by Dianna Hohensee (Inactive) [ 12/May/20 ]

Looking into this, writing down the deadlock for ease of communication:

LastVote
    waitUntilDurable
        _lastSyncMutex
            getToken
                refreshIfPrimary
                    AutoGetCollection (stuck on RSTL IX)

Stepdown (RSTL X)
    waitUntilDurable
        _lastSyncMutex (stuck on LastVote's acquisition)

Comment by Judah Schvimer [ 12/May/20 ]

dianna.hohensee, do you have any ideas on how to safely break this cycle while keeping the LastVote waitUntilDurable call uninterruptible? If not, then we should just revert the C++ changes in the patch and make the elections more robust with retries in remove_newly_added_field_after_finishing_initial_sync.js.

Generated at Thu Feb 08 05:16:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.