[SERVER-48125] Stepdown can deadlock with storing lastVote via journal flusher Created: 12/May/20 Updated: 06/Dec/22 Resolved: 20/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vesselina Ratcheva (Inactive) | Assignee: | Backlog - Replication Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 25 | ||||||||||||||||||||||||
| Description |
|
As part of storing the lastVote document, we will wait for it to be durable, and we will eventually call into refreshOplogTruncateAfterPointIfPrimary. This needs to acquire a global IX lock as part of an AutoGetCollection. This can deadlock with stepdown, as it tries to clear the oplog truncate after point, which in turn waits on a journal flush. The journal flusher needs to be able to run wait for durability too, but it cannot get to the critical section as that is protected by a mutex which is already held by the lastVote thread. We recently made storing the lastVote document fully uninterruptible in |
| Comments |
| Comment by Dianna Hohensee (Inactive) [ 20/May/20 ] | |||||||||
|
Yes! This ticket should be fixed by | |||||||||
| Comment by Judah Schvimer [ 20/May/20 ] | |||||||||
|
dianna.hohensee, is it safe for me to close this ticket as a duplicate of | |||||||||
| Comment by Dianna Hohensee (Inactive) [ 12/May/20 ] | |||||||||
|
We will have to add some kind of retryability on stepdown interrupt to JournalFlusher::waitForJournalFlush in Stepdown toggles off updating the oplogTruncateAfterPoint, then makes sure a round of flushing finishes in the JournalFlusher to clear the system before at last unsetting the oplogTruncateAfterPoint. So retrying immediately after that is safe: journal flush without the no longer needed oplogTruncateAfterPoint update that is causing the problems. It's just the race of waitUntilDurable callers not going through the JournalFlusher. | |||||||||
| Comment by Judah Schvimer [ 12/May/20 ] | |||||||||
That's fine, waiting a bit is strictly better than voting no when a node doesn't need to.
If it's interrupted, then we'd still have the original problem where nodes vote no when they shouldn't have to, right? Or would that not be interrupted on stepdown? | |||||||||
| Comment by Dianna Hohensee (Inactive) [ 12/May/20 ] | |||||||||
|
I believe so. Without the mutex, lastVote can't hold something that stepdown needs in order to proceed. LastVote will still have to wait for stepdown to finish, I think, to get the RSTL lock. Lingzhi found a BF where the oplogTruncateAfterPoint is magically set after stepdown clears it, probably by a concurrent thread doing waitUntilDurable during stepdown like lastVote, so I've also filed | |||||||||
| Comment by Judah Schvimer [ 12/May/20 ] | |||||||||
|
Great! So, to ensure I understand correctly, if | |||||||||
| Comment by Dianna Hohensee (Inactive) [ 12/May/20 ] | |||||||||
|
Okay, I've filed
| |||||||||
| Comment by Dianna Hohensee (Inactive) [ 12/May/20 ] | |||||||||
|
Looking into this, writing down the deadlock for ease of communication:
| |||||||||
| Comment by Judah Schvimer [ 12/May/20 ] | |||||||||
|
dianna.hohensee, do you have any ideas on how to safely break this cycle while keeping the LastVote waitUntilDurable call uninterruptible? If not, then we should just revert the C++ changes in the patch and make the elections more robust with retries in remove_newly_added_field_after_finishing_initial_sync.js. |