[SERVER-70873] Stepdown during drop collection can lead to a deadlock Created: 26/Oct/22  Updated: 29/Oct/23  Resolved: 04/Nov/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.2.0-rc0
Fix Version/s: 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Marcos José Grillo Ramirez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File BFG-1553779.log    
Issue Links:
Depends
Problem/Incident
is caused by SERVER-65016 Remove range deletions as part of `dr... Closed
Related
related to SERVER-60161 Deadlock between config server stepdo... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14
Participants:
Linked BF Score: 135

 Description   

As part of SERVER-65016 a new code to remove the range deletion document was added as an optimization into the existing drop collection code, however, this code is using an alternative client region in order to remove multiple documents, this is done because the shardsvr_drop_collection_participant uses the retryable write machinery to guard against replay protection.

The unintended effect of this is that a thread that is dropping a collection will first checkout a session, and then, as part of taking the collection lock it will try to grab the RSTL lock when executing the DBClient command to remove the range deletion documents. If a stepdown sneaks in after the session is checked out, then the stepdown thread will grab the RSTL lock and then try to checkout and kill all running sessions, causing a deadlock.

In the attached stacktrace log this situation can be seen between the Thread 2 and Thread 99. One way to solve this is to do create the operation context the same way the rename collection metadata command does, which is, linking the new operation context created in the alternative region to the parent cancellation token, this way, during the stepdown, when the parent operation context is interrupted, the thread waiting for the lock will finish, liberating the session, allowing the shutdown thread to effectively checking it out.



 Comments   
Comment by Githook User [ 04/Nov/22 ]

Author:

{'name': 'Marcos José Grillo Ramirez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-70873 Use cancelable OperationContext in ACR to prevent deadlock with stepdowns
Branch: master
https://github.com/mongodb/mongo/commit/f06a898557f951bf48569012c4372e770d1c0267

Generated at Thu Feb 08 06:17:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.