Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.4.6
Affects Version/s: None
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.4
Sprint:
Repl 2021-04-05, Repl 2021-04-19
Linked BF Score:
128
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In renameCollection, we obtain the db lock in X mode. Then, we assert that there are no in-progress index builds. However, according to the logs in the linked BF, an index build was in-progress while we were running renameCollection. In the index build code path, it seems that we don't always hold on to collection and DB level locks. Specifically, while we are waiting for a next action, we assert that we are not holding any locks. As a result, it seems possible for the following sequence to occur:

A node starts initial sync
Index build starts
Index build releases collection/db level locks while waiting for next action but before committing
renameCollection command begins and obtains DB lock
renameCollection tries to assert that there are no index builds in progress, finds that there are, and aborts in progress index builds

After the index build aborts, we should probably fail initial sync. However, in the BF, we see that initial sync still managed to complete successfully. The test only failed when checkReplicatedDataHashes discovered a DB hash mismatch between the primary and initial sync node.

EDIT: We found that the root bug is that the node aborts all index builds when it is in initial sync. It should only be safe to abort two phase index builds – the node should wait for single phase index builds to complete instead of aborting them.

Single phase index builds aren't safe to abort because the primary does not wait on secondaries before committing/aborting them. As a result, this could lead to a situation where the primary commits an index build that the initial sync node will abort. For two-phase index builds, I believe the primary will get notified if the index build is aborted, so it is able to abort its own index build and notify the user, thus avoiding any data inconsistency issues.

Modified the ticket name to reflect this change.

related to

SERVER-58280 initial sync hangs on hiding dropped index when index builds are active

Closed

SERVER-54997 Potential issue when building an index and running renameCollection

Closed

Assignee:: Huayu Ouyang
Reporter:: Xuerui Fa
Participants:: Githook User, Huayu Ouyang, Xuerui Fa
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Mar 05 2021 09:10:03 PM UTC
Updated:: Oct 29 2023 09:56:37 PM UTC
Resolved:: Apr 07 2021 02:53:51 PM UTC
Confidence Status Last Update:: 25/Mar/21 5:53 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates