In renameCollection, we obtain the db lock in X mode. Then, we assert that there are no in-progress index builds. However, according to the logs in the linked BF, an index build was in-progress while we were running renameCollection. In the index build code path, it seems that we don't always hold on to collection and DB level locks. Specifically, while we are waiting for a next action, we assert that we are not holding any locks. As a result, it seems possible for the following sequence to occur:
- A node starts initial sync
- Index build starts
- Index build releases collection/db level locks while waiting for next action but before committing
- renameCollection command begins and obtains DB lock
- renameCollection tries to assert that there are no index builds in progress, finds that there are, and aborts in progress index builds
After the index build aborts, we should probably fail initial sync. However, in the BF, we see that initial sync still managed to complete successfully. The test only failed when checkReplicatedDataHashes discovered a DB hash mismatch between the primary and initial sync node.
EDIT: We found that the root bug is that the node aborts all index builds when it is in initial sync. It should only be safe to abort two phase index builds – the node should wait for single phase index builds to complete instead of aborting them.
Single phase index builds aren't safe to abort because the primary does not wait on secondaries before committing/aborting them. As a result, this could lead to a situation where the primary commits an index build that the initial sync node will abort. For two-phase index builds, I believe the primary will get notified if the index build is aborted, so it is able to abort its own index build and notify the user, thus avoiding any data inconsistency issues.
Modified the ticket name to reflect this change.