[SERVER-46704] Two phase index build can violate locking ordering and can lead to deadlocks. Created: 09/Mar/20  Updated: 29/Oct/23  Resolved: 19/Mar/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Suganthi Mani
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
duplicates SERVER-46989 Index builds should hold RSTL to prev... Closed
Related
related to SERVER-42621 3 way deadlock can happen between hyb... Closed
related to SERVER-44722 3 way deadlock can happen between hyb... Closed
related to SERVER-46664 runCmdOnPrimaryAndAwaitResponse() sho... Closed
related to SERVER-44045 allow secondary index builds to start... Closed
is related to SERVER-46910 2 phase index builds should not try t... Closed
is related to SERVER-46917 Index builder on receiving commit/abo... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Execution Team 2020-03-09, Execution Team 2020-03-23
Participants:
Linked BF Score: 49

 Description   

Currently, IndexBuildsCoordinatorMongod::voteCommitIndexBuild() violates the lock ordering, i.e., it tries to acquire RSTL lock in mode IX with ReplIndexBuildState::mutex held. As a result, it can deadlock with stepup code path (ReplicationCoordinatorImpl::signalDrainComplete), as it acquires RSTL lock in X mode first, and then tries to send abort or commit signal to index build by holding ReplIndexBuildState::mutex.

Note:
The ticket also address 3 more issues.
1) Currently, the index build (internal system thread) holds RSTl lock with uninterruptible guard enabled. And, it blocks replication state transition, like, step up, step down. (SERVER-44045)

2) We are acquiring collection lock in stronger mode (mode X) in order to commit or abort. As, a result, this can lead to deadlocks involving prepared transactions, stepdown and indexBuildsCoordinator. (SERVER-44722)

3) Currently IndexBuildsCoordinatorMongod::_waitForNextIndexBuildAction() holds RSTL only for the while loop scope. As a result, the primary check that we are doing at this line, can no longer be valid. (SERVER-46989)

4) Also, index build retries to vote on error without checking any interrupts, like, shutdown interrupts. This makes shutdown to hang forever, as it waits for the index builds to complete.

UPDATE: This ticket won't address the 3 additional issues. And it will be addressed separately.



 Comments   
Comment by Githook User [ 19/Mar/20 ]

Author:

{'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani', 'username': 'smani87'}

Message: SERVER-46704 IndexBuildsCoordinatorMongod::voteCommitIndexBuild should not persist votes with ReplIndexBuildState::mutex lock held.

(cherry picked from commit 1d7f1c5d9e482b7508a1879eda9835cd1ea2c185)
Branch: v4.4
https://github.com/mongodb/mongo/commit/bc5c557e3297133d18c26870a781b7bfa92de399

Comment by Githook User [ 19/Mar/20 ]

Author:

{'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani', 'username': 'smani87'}

Message: SERVER-46704 IndexBuildsCoordinatorMongod::voteCommitIndexBuild should not persist votes with ReplIndexBuildState::mutex lock held.
Branch: master
https://github.com/mongodb/mongo/commit/1d7f1c5d9e482b7508a1879eda9835cd1ea2c185

Generated at Thu Feb 08 05:12:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.