Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-55008

Only abort two-phase index builds when BackgroundOperationInProg error in initial sync

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4.6
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4
    • Sprint:
      Repl 2021-04-05, Repl 2021-04-19
    • Linked BF Score:
      128

      Description

      In renameCollection, we obtain the db lock in X mode. Then, we assert that there are no in-progress index builds. However, according to the logs in the linked BF, an index build was in-progress while we were running renameCollection. In the index build code path, it seems that we don't always hold on to collection and DB level locks. Specifically, while we are waiting for a next action, we assert that we are not holding any locks. As a result, it seems possible for the following sequence to occur:

      1. A node starts initial sync
      2. Index build starts
      3. Index build releases collection/db level locks while waiting for next action but before committing
      4. renameCollection command begins and obtains DB lock
      5. renameCollection tries to assert that there are no index builds in progress, finds that there are, and aborts in progress index builds

      After the index build aborts, we should probably fail initial sync. However, in the BF, we see that initial sync still managed to complete successfully. The test only failed when checkReplicatedDataHashes discovered a DB hash mismatch between the primary and initial sync node.

      EDIT: We found that the root bug is that the node aborts all index builds when it is in initial sync. It should only be safe to abort two phase index builds Рthe node should wait for single phase index builds to complete instead of aborting them.

      Single phase index builds aren't safe to abort because the primary does not wait on secondaries before committing/aborting them. As a result, this could lead to a situation where the primary commits an index build that the initial sync node will abort. For two-phase index builds, I believe the primary will get notified if the index build is aborted, so it is able to abort its own index build and notify the user, thus avoiding any data inconsistency issues.

      Modified the ticket name to reflect this change.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              huayu.ouyang Huayu Ouyang
              Reporter:
              xuerui.fa Xuerui Fa
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: