[SERVER-79648] Investigate why dbcheck deadlock with fsyncLock with blocked insert op Created: 03/Aug/23  Updated: 27/Oct/23  Resolved: 18/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Moustafa Maher Assignee: Moustafa Maher
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Repl 2023-08-21
Participants:

 Description   

These tests utilize 'fsyncLock' and subsequently wait for an insert operation to be blocked by checking the inprogress 'currentOp', leading to a deadlock with the dbCheck process.

  • jstests/core/administrative/current_op/currentop.js
  • jstests/core/fsync.js

We need to know why the deadlock is happening.



 Comments   
Comment by Moustafa Maher [ 18/Aug/23 ]

The deadlock analysis:
Test goes through these 3 steps:
1- FsynLock
2- Do an insert and wait for the currentOp to show that the insert is waiting for lock (the flusher).  (<< Times out waiting for the curOps when dbcheck is on >> )
3- FsyncUnlock and let the insert to succeed. 

when fsyncLock is on, multiple threads takes writeTickets and blocked on the Flusher, and then other threads can’t take any more tickets using either PriorityTicketHolder and SemaphoreTicketHolder ,because there are no more tickets available as our tests sets the limit wiredTigerConcurrentWriteTransactions: 30, so what dbcheck does that it takes writeTickets while trying to write the oplog for the dbcheck and then block on the flusher, so when the insert from the test comes it didn't find any writeTickets and waits for it and it didn't even AquireCollection so the curOp doesn't get logged.
 

Generated at Thu Feb 08 06:41:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.