[SERVER-46886] BackgroundSync (rollback) thread can hang on aborting the index build. Created: 16/Mar/20 Updated: 27/Oct/23 Resolved: 07/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | Eric Milkie |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.4
|
||||||||
| Sprint: | Execution Team 2020-04-06, Execution Team 2020-04-20 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 18 | ||||||||
| Description |
|
Let's assume our indexBuild's coordinator thread pool size is 1 and a 3 node replica set. Say, node A, the primary, received 2 createIndexes cmd, one to build 'x' and other to build 'y' index on foo.bar collection. Since the indexBuild's coordinator thread pool size is 1, we are able to start only the index build 'x' on 'indexBuildsCoordinatorMongod' thread pool. And, other index build 'y' has to wait until the index build 'x' completes and frees the thread. Also, assume the commit quorum value is 'all', so the index build 'x' can't commit until it receives votes from other 2 secondaries (node B & node C). Assume, before replicating 'startIndexBuild' oplog entry for index build 'x' to secondaries, the primary stepped down and node C became new primary. And, node C doesn't know anything about index build 'x' and 'y' that were started on node A (old primary). This would result in old primary (node A) to rollback. Before doing rollback, BackgroundSync thread aborts any active index builds. Now, we can get into a deadlock scenario on node A, if The net effect is that the node A will get stuck on rollback process and couldn't transition to secondary replication state. |
| Comments |
| Comment by Eric Milkie [ 07/Apr/20 ] |
|
We made the thread pool unlimited in |
| Comment by Eric Milkie [ 20/Mar/20 ] |
|
I agree, that's a reasonable solution for this one. |
| Comment by Suganthi Mani [ 16/Mar/20 ] |
|
For each active index build, we currently abort the index build and wait for it join. This can lead to above deadlock scenario. So, the solution should be something, like this, we should abort all active index builds and then wait for all the index builder threads to join. |