[SERVER-70949] Condition variable wait might lead to liveness failure and then crash a node Created: 28/Oct/22  Updated: 02/Nov/22  Resolved: 01/Nov/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.2.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Tommaso Tocci
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File BFG-1540486-stacktrace.log    
Issue Links:
Depends
Duplicate
duplicates SERVER-70964 Do not wait for range deletion thread... Closed
Problem/Incident
is caused by SERVER-69886 Properly handle shutdown of range del... Closed
Operating System: ALL
Sprint: Sharding EMEA 2022-11-14
Participants:
Linked BF Score: 135

 Description   

SERVER-69886 added code to prevent delaying shutting down a node a when range deletion task was ongoing. As part of these changes, a new object was added, ReadyRangeDeletionsProcessor with the sole purpose of being responsible for processing the actual enqueued range deletions.

However, in the constructor of this object a wait for the queue's conditional variable was added (besides the internal wait of the processor's thread), and this object get's created in the onStepUpComplete callback, this is a problem because this is called by the ReplicationCoordinator under the RSTL lock, so, if there are no range deletion tasks enqueued that would unblock this thread, then a future stepDown would never work, because it would never acquire the RSTL lock. And given the fact that the object will never be reset (which would be another way of awakening the thread) the process will never serve operations, and will eventually crash on the next stepDown. In the attached stacktrace such situation can be seen happening, in Thread 1 a stepDown thread crashes the node, and in Thread 62 we can see the stepUp thread holding the RSTL lock.

There shouldn't be a wait in the ReadyRangeDeletionsProcessor's constructor, however, there should be some extra investigation to ensure the same guarantees currently provided by the object.


Generated at Thu Feb 08 06:17:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.