Details
-
Bug
-
Resolution: Gone away
-
Major - P3
-
None
-
None
-
None
-
ALL
-
v4.4
-
Execution Team 2020-04-06, Execution Team 2020-04-20
-
18
Description
Let's assume our indexBuild's coordinator thread pool size is 1 and a 3 node replica set. Say, node A, the primary, received 2 createIndexes cmd, one to build 'x' and other to build 'y' index on foo.bar collection. Since the indexBuild's coordinator thread pool size is 1, we are able to start only the index build 'x' on 'indexBuildsCoordinatorMongod' thread pool. And, other index build 'y' has to wait until the index build 'x' completes and frees the thread.
Also, assume the commit quorum value is 'all', so the index build 'x' can't commit until it receives votes from other 2 secondaries (node B & node C). Assume, before replicating 'startIndexBuild' oplog entry for index build 'x' to secondaries, the primary stepped down and node C became new primary. And, node C doesn't know anything about index build 'x' and 'y' that were started on node A (old primary). This would result in old primary (node A) to rollback. Before doing rollback, BackgroundSync thread aborts any active index builds.
Now, we can get into a deadlock scenario on node A, if
1) BackgroundSync thread signals the index build, say, 'y' to abort and waits for the index builder (indexBuildsCoordinatorMongod-X) thread (for 'y') to join.
2) But, the index builder for 'y' is stuck waiting for index build 'x' to complete and free the indexBuildsCoordinatorMongod-X thread.
3) Index builder for 'x' is waiting for 'BackgroundSync' thread to signal abort its index build.
The net effect is that the node A will get stuck on rollback process and couldn't transition to secondary replication state.