[SERVER-79810] make JournalFlusher::waitForJournalFlush() interruptible when waiting for write concern Created: 07/Aug/23  Updated: 01/Feb/24  Resolved: 10/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Shin Yee Tan Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Related
related to SERVER-79919 write js test for SERVER-79810 Closed
is related to SERVER-48149 Move callers of waitUntilDurable onto... Closed
is related to SERVER-55745 The Fuzzer can run killOp on the Jour... Closed
is related to SERVER-57229 killOp_against_journal_flusher_thread... Closed
is related to SERVER-79026 Failing to cancel the JournalFlusher ... Closed
is related to SERVER-61484 Allow ExceededMemoryLimit to be a ben... Closed
is related to SERVER-79174 Improve journal flusher interruption ... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0
Steps To Reproduce:

The deadlock is outlined here but I've copy and pasted it for ease:

  1. A thread with setAllowMigrations (which checked out a session) waits for the changes to the metadata to be majority committed
  2. A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it
  3. Another thread with the JournalFlusher tries to take the RSTL lock taken by 2.
Sprint: Execution NAMR Team 2023-08-21
Participants:
Linked BF Score: 120

 Description   

When stepping down, we want caller threads to be able to know that the journal flusher was interrupted. Otherwise, we can get into a deadlock.

Areas that may need to be addressed:

  1. All interruptions are retried, even ErrorCodes::InterruptedDueToReplStateChange. We want to be able to setError when the journal flusher is interrupted during stepdown.
  2. Even if we setError, we still get stuck in this infinite while loop retrying to flush the journal.
  3. The caller we're concerned about is writeConcern for the stepdown deadlock is writeConcern. We may want to waitForJournalFlusher without retrying and could introduce a new method for this. Or we may want to pass the writeConcern's opCtx to the journalFlusher so it can be interrupted.

We should add a test for this deadlock so we can confirm fixing it and catching it early if it happens again.



 Comments   
Comment by Githook User [ 19/Dec/23 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: Revert "SERVER-79810 remove unnecessary loop around JournalFlusher::_waitForJournalFlushNoRetry()"

This reverts commit 8b8982cee48b0e3714a2fe19faeea78ec4dce409.

GitOrigin-RevId: d5e4ea2a8e73749fa392eb33ac2ede666e815425
Branch: v7.0
https://github.com/mongodb/mongo/commit/f37b15c94b3ea4303a4e6f3d937cf97bb71bb913

Comment by Githook User [ 19/Dec/23 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: Revert "SERVER-79810 JournalFlusher::waitForJournalFlush() accepts Interruptible"

This reverts commit dc048a80c300f108e20018283edd2ad01854cbc1.

GitOrigin-RevId: 0abf7713d8e0a5d628d6880263688d9e073ebcef
Branch: v7.0
https://github.com/mongodb/mongo/commit/d3fa2449958bd581d57c02de07b7c61fbe2560c2

Comment by Githook User [ 19/Dec/23 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: Revert "SERVER-79810 make JournalFlusher::waitForJournalFlush() interruptible when waiting for write concern"

This reverts commit 812b333ab2e6eb10c3194a51b27c1846fa5272e9.

GitOrigin-RevId: 36190d3d80f0e06935ff74fcde7f629d4c6f2b83
Branch: v7.0
https://github.com/mongodb/mongo/commit/cb622bfd89020cfbd76b160fdb4e2198df3de3c9

Comment by Githook User [ 13/Dec/23 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-79810 make JournalFlusher::waitForJournalFlush() interruptible when waiting for write concern

(cherry picked from commit a0c9b5ca2b4cd85677b1ceecb2f2bb68d6b92322)

GitOrigin-RevId: 812b333ab2e6eb10c3194a51b27c1846fa5272e9
Branch: v7.0
https://github.com/mongodb/mongo/commit/b67c741f214de816f61ed190a58d4d9f6c936403

Comment by Githook User [ 13/Dec/23 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-79810 JournalFlusher::waitForJournalFlush() accepts Interruptible

The default behavior is for this operation to be uninterruptible. Callers
have to pass in an OperationContext to make this operation interruptible.

(cherry picked from commit d9b8636c499426bde024f3ebc04fdcc78349ee05)

GitOrigin-RevId: dc048a80c300f108e20018283edd2ad01854cbc1
Branch: v7.0
https://github.com/mongodb/mongo/commit/85953b0033b49c3774f2923244ce5ed45c559f90

Comment by Githook User [ 13/Dec/23 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-79810 remove unnecessary loop around JournalFlusher::_waitForJournalFlushNoRetry()

The JournalFlusher background thread absorbs interruption errors and will
never propagate interruptions to _waitForJournalFlushNoRetry().

(cherry picked from commit 56fc93b5c302acf3b5db78477e1533dc2ef08cfe)

GitOrigin-RevId: 8b8982cee48b0e3714a2fe19faeea78ec4dce409
Branch: v7.0
https://github.com/mongodb/mongo/commit/5245a1e914b2b863b11f331cdbca7ecdd85bc901

Comment by Benety Goh [ 11/Aug/23 ]

Changing title because the merged changes do not quite match the current ticket descripton/title.

The background thread in the JournalFlusher will continue restart its loop on receiving any interruptions. It will not forward these interruptions to callers waiting in JournalFlusher::waitForJournalFlush().

Threads blocked on JournalFlusher::waitForJournalFlush() may now provide an Interruptible instance (OperationContext for example) that may be used to wake the thread up and unblock the waitForJournalFlush(). Currently, the only scenario where we opt into this behavior is in waitForWriteConcern().

OLD TITLE: Allow journal flusher to signal interruptions to callers
NEW TITLE: make JournalFlusher::waitForJournalFlush() interruptible when waiting for write concern

Comment by Githook User [ 10/Aug/23 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-79810 make JournalFlusher::waitForJournalFlush() interruptible when waiting for write concern
Branch: master
https://github.com/mongodb/mongo/commit/a0c9b5ca2b4cd85677b1ceecb2f2bb68d6b92322

Comment by Githook User [ 09/Aug/23 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-79810 JournalFlusher::waitForJournalFlush() accepts Interruptible

The default behavior is for this operation to be uninterruptible. Callers
have to pass in an OperationContext to make this operation interruptible.
Branch: master
https://github.com/mongodb/mongo/commit/d9b8636c499426bde024f3ebc04fdcc78349ee05

Comment by Githook User [ 09/Aug/23 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-79810 remove unnecessary loop around JournalFlusher::_waitForJournalFlushNoRetry()

The JournalFlusher background thread absorbs interruption errors and will
never propagate interruptions to _waitForJournalFlushNoRetry().
Branch: master
https://github.com/mongodb/mongo/commit/56fc93b5c302acf3b5db78477e1533dc2ef08cfe

Generated at Thu Feb 08 06:41:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.