[SERVER-70888] ScopedRangeDeleterLock might lead to a deadlock on stepdown Created: 27/Oct/22  Updated: 29/Oct/23  Resolved: 28/Oct/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.2.0-rc0
Fix Version/s: 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Tommaso Tocci
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File BFG-1553238-stacktrace.log    
Issue Links:
Depends
Problem/Incident
is caused by SERVER-70094 Synchronize shutdown with resuming of... Closed
Related
related to SERVER-60161 Deadlock between config server stepdo... Closed
related to SERVER-70864 Get rid of fine grained scoped range ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2022-10-31
Participants:
Linked BF Score: 153

 Description   

SERVER-70094 added code to synchronize the range deletion with stepdowns, specifically, it stores the executor of the range deletion thread so it can be joined when stopping the service.

This have an unintended consequence though, if a stepdown command comes in at a time that manages to grab the RSTL lock before the RangeDeleterService thread does, it will get stuck when trying to stop the service (because it is waiting for the range deleter service executor), when at the same time, the range deleter service thread is actually waiting for the RSTL lock.

So we have a thread with the RSTL lock held waiting for an executor that will finish only after it grabs the RSTL lock.

In order to solve this, besides the executor, we could also capture the operation context and cancel it before waiting for the executor.



 Comments   
Comment by Githook User [ 28/Oct/22 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-70888 ScopedRangeDeleterLock might lead to a deadlock on stepdown
Branch: master
https://github.com/mongodb/mongo/commit/64488b038d94743efebd703ffe3fb0003468e686

Comment by Pierlauro Sciarelli [ 27/Oct/22 ]

Yet another bug that will be solved by SERVER-70864, thanks for signaling it!

Generated at Thu Feb 08 06:17:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.