[SERVER-48123] voteCommitIndexBuild() can hang waiting for its lock acquisition to be granted when there is a stronger lock request waiting ahead of it and when the index build is being aborted Created: 12/May/20  Updated: 27/Oct/23  Resolved: 18/May/20

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gregory Wlodarek Assignee: Gregory Wlodarek
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-48235 The primary node should use the Async... Closed
Operating System: ALL
Sprint: Execution Team 2020-06-01
Participants:
Linked BF Score: 22

 Description   

This problem came up in a build failure that resulted in a test timeout on the primary node.

  1. [T1] We have an in-progress two-phase index build.
  2. [T2] User runs the command to abort the in-progress index build.
  3. [T1] Index build is ready to vote for committing.
  4. [T1] Runs the vote command locally via the DBDirectClient [T3].
  5. [T2] Performs the abort and waits until the index builder thread [T1] receives the signal. Continues holding the exclusive collection lock.
  6. [T4] applyOps command is run that requires the exclusive global lock. The request gets enqueued as [T2] is holding the intent global lock.
  7. [T3] Tries to get the index build entry in the config.system.indexBuilds collection. Requires the intent global lock, its request gets enqueued behind [T4]'s global lock request.

Given this we basically have the following deadlock presenting itself:

  • [T2] holds the Global IX, Database IX, Collection X locks and is waiting for [T1] to finish so that it can complete aborting the index build.
  • [T1] is waiting for [T3] to finish voting to commit the index build.
  • [T4] has a lock request enqueued and is waiting for it (~1.989hrs when the test timed out, Global X). [T2] is preventing this acquisition from going through as it continues to hold its locks.
  • [T3] has a lock request enqueued behind [T4]'s request and is waiting for it (~1.989hrs when the test timed out, Global IX).

 



 Comments   
Comment by Gregory Wlodarek [ 18/May/20 ]

The change in SERVER-48235 to remove the DBDirectClient when running the 'voteCommitIndexBuild' command inadvertently fixes this issue. This is because the abort logic will cancel the AsyncDBClient request here. Prior to SERVER-48235 we did not cancel the requests started via the DBDirectClient. I will mark this as fixed.

Comment by Eric Milkie [ 18/May/20 ]

This may have been fixed now

Generated at Thu Feb 08 05:16:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.