Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Storage Execution
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v7.0
Steps To Reproduce:
Hide

The deadlock is outlined here but I've copy and pasted it for ease:

A thread with setAllowMigrations (which checked out a session) waits for the changes to the metadata to be majority committed

A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it

Another thread with the JournalFlusher tries to take the RSTL lock taken by 2.
Show
The deadlock is outlined here but I've copy and pasted it for ease: A thread with setAllowMigrations (which checked out a session) waits for the changes to the metadata to be majority committed A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it Another thread with the JournalFlusher tries to take the RSTL lock taken by 2.
Sprint:
Execution NAMR Team 2023-08-21
Linked BF Score:
120
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When stepping down, we want caller threads to be able to know that the journal flusher was interrupted. Otherwise, we can get into a deadlock.

Areas that may need to be addressed:

All interruptions are retried, even ErrorCodes::InterruptedDueToReplStateChange. We want to be able to setError when the journal flusher is interrupted during stepdown.
Even if we setError, we still get stuck in this infinite while loop retrying to flush the journal.
The caller we're concerned about is writeConcern for the stepdown deadlock is writeConcern. We may want to waitForJournalFlusher without retrying and could introduce a new method for this. Or we may want to pass the writeConcern's opCtx to the journalFlusher so it can be interrupted.

We should add a test for this deadlock so we can confirm fixing it and catching it early if it happens again.

is related to

SERVER-48149 Move callers of waitUntilDurable onto JournalFlusher::waitForJournalFlush

Closed

SERVER-55745 The Fuzzer can run killOp on the JournalFlusher thread and cause it to throw an unexpected error

Closed

SERVER-57229 killOp_against_journal_flusher_thread.js must ensure the JournalFlusher doesn't reset the opCtx between finding the opId and running killOp

Closed

SERVER-79026 Failing to cancel the JournalFlusher thread might lead to 3-way deadlock

Closed

SERVER-61484 Allow ExceededMemoryLimit to be a benign log warning instead of an invariant in the JournalFlusher

Closed

SERVER-79174 Improve journal flusher interruption handling

Closed

related to

SERVER-79919 write js test for SERVER-79810

Closed

(1 is related to, 1 related to)

Assignee:: Benety Goh
Reporter:: Shin Yee Tan
Participants:: Benety Goh, Githook User, Shin Yee Tan
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Aug 07 2023 10:46:15 PM UTC
Updated:: Dec 13 2024 08:06:58 PM UTC
Resolved:: Aug 10 2023 02:05:28 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates