[SERVER-55008] Only abort two-phase index builds when BackgroundOperationInProg error in initial sync Created: 05/Mar/21 Updated: 29/Oct/23 Resolved: 07/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Xuerui Fa | Assignee: | Huayu Ouyang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||
| Sprint: | Repl 2021-04-05, Repl 2021-04-19 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 128 | ||||||||||||||||||||||||
| Description |
|
In renameCollection, we obtain the db lock in X mode. Then, we assert that there are no in-progress index builds. However, according to the logs in the linked BF, an index build was in-progress while we were running renameCollection. In the index build code path, it seems that we don't always hold on to collection and DB level locks. Specifically, while we are waiting for a next action, we assert that we are not holding any locks. As a result, it seems possible for the following sequence to occur:
After the index build aborts, we should probably fail initial sync. However, in the BF, we see that initial sync still managed to complete successfully. The test only failed when checkReplicatedDataHashes discovered a DB hash mismatch between the primary and initial sync node. EDIT: We found that the root bug is that the node aborts all index builds when it is in initial sync. It should only be safe to abort two phase index builds – the node should wait for single phase index builds to complete instead of aborting them. Single phase index builds aren't safe to abort because the primary does not wait on secondaries before committing/aborting them. As a result, this could lead to a situation where the primary commits an index build that the initial sync node will abort. For two-phase index builds, I believe the primary will get notified if the index build is aborted, so it is able to abort its own index build and notify the user, thus avoiding any data inconsistency issues. Modified the ticket name to reflect this change. |
| Comments |
| Comment by Githook User [ 08/Apr/21 ] |
|
Author: {'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com', 'username': 'huayu-ouyang'}Message: |
| Comment by Githook User [ 07/Apr/21 ] |
|
Author: {'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com', 'username': 'huayu-ouyang'}Message: |
| Comment by Xuerui Fa [ 31/Mar/21 ] |
|
Declining the 4.4 backport since the bug only occurs on 4.4, so it will be fixed directly. This can't happen on master because we only use two-phase index builds in replica sets on master, so initial sync can't abort single phase index builds. |