[SERVER-46886] BackgroundSync (rollback) thread can hang on aborting the index build. Created: 16/Mar/20  Updated: 27/Oct/23  Resolved: 07/Apr/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Eric Milkie
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Operating System: ALL
Backport Requested:
v4.4
Sprint: Execution Team 2020-04-06, Execution Team 2020-04-20
Participants:
Linked BF Score: 18

 Description   

Let's assume our indexBuild's coordinator thread pool size is 1 and a 3 node replica set. Say, node A, the primary, received 2 createIndexes cmd, one to build 'x' and other to build 'y' index on foo.bar collection. Since the indexBuild's coordinator thread pool size is 1, we are able to start only the index build 'x' on 'indexBuildsCoordinatorMongod' thread pool. And, other index build 'y' has to wait until the index build 'x' completes and frees the thread.

Also, assume the commit quorum value is 'all', so the index build 'x' can't commit until it receives votes from other 2 secondaries (node B & node C). Assume, before replicating 'startIndexBuild' oplog entry for index build 'x' to secondaries, the primary stepped down and node C became new primary. And, node C doesn't know anything about index build 'x' and 'y' that were started on node A (old primary). This would result in old primary (node A) to rollback. Before doing rollback, BackgroundSync thread aborts any active index builds.

Now, we can get into a deadlock scenario on node A, if
1) BackgroundSync thread signals the index build, say, 'y' to abort and waits for the index builder (indexBuildsCoordinatorMongod-X) thread (for 'y') to join.
2) But, the index builder for 'y' is stuck waiting for index build 'x' to complete and free the indexBuildsCoordinatorMongod-X thread.
3) Index builder for 'x' is waiting for 'BackgroundSync' thread to signal abort its index build.

The net effect is that the node A will get stuck on rollback process and couldn't transition to secondary replication state.



 Comments   
Comment by Eric Milkie [ 07/Apr/20 ]

We made the thread pool unlimited in SERVER-47155, so that resolved this issue.

Comment by Eric Milkie [ 20/Mar/20 ]

I agree, that's a reasonable solution for this one.

Comment by Suganthi Mani [ 16/Mar/20 ]

For each active index build, we currently abort the index build and wait for it join. This can lead to above deadlock scenario. So, the solution should be something, like this, we should abort all active index builds and then wait for all the index builder threads to join.

Generated at Thu Feb 08 05:12:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.