[SERVER-79026] Failing to cancel the JournalFlusher thread might lead to 3-way deadlock Created: 17/Jul/23  Updated: 29/Oct/23  Resolved: 18/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 7.1.0-rc0
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File gdb.BFG-2016684.c_n1.txt     Text File gdb_s0_n2.txt    
Issue Links:
Depends
Related
related to SERVER-79810 make JournalFlusher::waitForJournalFl... Closed
related to SERVER-79174 Improve journal flusher interruption ... Closed
is related to SERVER-55745 The Fuzzer can run killOp on the Jour... Closed
is related to SERVER-73539 stopMigrations/resumeMigrations don't... Closed
is related to SERVER-78021 Retrying setAllowMigrations command m... Closed
is related to SERVER-74657 revisit if thread marked as unkillabl... Open
is related to SERVER-70127 Default system operations to be killa... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution NAMR Team 2023-07-24
Participants:
Linked BF Score: 140

 Description   

SERVER-73539 introduced replay protection for setAllowmigrations, as part of those changes (and the posterior fix SERVER-78021), we create an AlternativeClientRegion where a transaction with a majority write concern is performed. Currently the JournalFlusher is an unkillable thread that tries to get the RSTL lock when waiting until all commits before the call are durable in the journal, so, in the presence of a stepdown, the following scenario might happen in the config server:

  1. A thread with setAllowMigrations (which checked out a session) waits for the changes to the metadata to be majority committed
  2. A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it
  3. Another thread with the JournalFlusher tries to take the RSTL lock taken by 2.

After 3 we have one thread (1) waiting for majority, but the thread that waits for the changes to become durable (2) is waiting for the RSTL lock that is taken by the stepdown thread (3) waiting for a session to be checked in, causing a 3-way deadlock. Attached to the ticket we can find 2 stacktraces with the problem described above.

One way this could be solved is by making the JournalFlusher thread to also be killable like the main operation (in this case the setAllowMigrations thread).



 Comments   
Comment by Githook User [ 18/Jul/23 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-79026 Make the JournalFlusher killable
Branch: master
https://github.com/mongodb/mongo/commit/7dcc87d19f79526ced914a07688b7afe9ec545f8

Comment by Benety Goh [ 18/Jul/23 ]

We added a JS test for the killOp behavior in SERVER-55745. The expected log message logging an stepdown-interrupted Journal Flusher thread will have the same message ID as the killOp behavior.

Comment by Benety Goh [ 18/Jul/23 ]

We added the stepdown exclusion in SERVER-70127.

Generated at Thu Feb 08 06:39:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.