|
This problem came up in a build failure that resulted in a test timeout on the primary node.
- [T1] We have an in-progress two-phase index build.
- [T2] User runs the command to abort the in-progress index build.
- [T1] Index build is ready to vote for committing.
- [T1] Runs the vote command locally via the DBDirectClient [T3].
- [T2] Performs the abort and waits until the index builder thread [T1] receives the signal. Continues holding the exclusive collection lock.
- [T4] applyOps command is run that requires the exclusive global lock. The request gets enqueued as [T2] is holding the intent global lock.
- [T3] Tries to get the index build entry in the config.system.indexBuilds collection. Requires the intent global lock, its request gets enqueued behind [T4]'s global lock request.
Given this we basically have the following deadlock presenting itself:
- [T2] holds the Global IX, Database IX, Collection X locks and is waiting for [T1] to finish so that it can complete aborting the index build.
- [T1] is waiting for [T3] to finish voting to commit the index build.
- [T4] has a lock request enqueued and is waiting for it (~1.989hrs when the test timed out, Global X). [T2] is preventing this acquisition from going through as it continues to hold its locks.
- [T3] has a lock request enqueued behind [T4]'s request and is waiting for it (~1.989hrs when the test timed out, Global IX).
|