[SERVER-54997] Potential issue when building an index and running renameCollection Created: 05/Mar/21  Updated: 31/Mar/21  Resolved: 31/Mar/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Xuerui Fa Assignee: Louis Williams
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-55008 Only abort two-phase index builds whe... Closed
Sprint: Execution Team 2021-04-05
Participants:
Linked BF Score: 128

 Description   

In renameCollection, we obtain the db lock in X mode. Then, we assert that there are no in-progress index builds. However, according to the logs in the linked BF, an index build was in-progress while we were running renameCollection. In the index build code path, it seems that we don't always hold on to collection and DB level locks. Specifically, while we are waiting for a next action, we assert that we are not holding any locks. As a result, it seems possible for the following sequence to occur:

  1. A node starts initial sync
  2. Index build starts
  3. Index build releases collection/db level locks while waiting for next action but before committing
  4. renameCollection command begins and obtains DB lock
  5. renameCollection tries to assert that there are no index builds in progress, finds that there are, and aborts in progress index builds

It seems the initial sync node managed to complete initial sync successfully, even after the index build was aborted. I filed SERVER-55008 to resolve that issue on the Replication side. Open question for Execution: is the above situation considered a bug in the system? If not, I think we can resolve this by not running the JS test while nodes are going through initial sync.

 



 Comments   
Comment by Dianna Hohensee (Inactive) [ 31/Mar/21 ]

Alright, happy to close without doing any work  Thanks for looking into it more. Reassigning to Louis, since he looked into it.

Comment by Xuerui Fa [ 31/Mar/21 ]

After discussing with louis.williams, we think the root bug is that the node aborts all index builds when it is in initial sync. It should only be safe to abort two phase index builds – the node should wait for single phase index builds to complete instead of aborting them.

We will change the code to only abort two phase index builds in SERVER-55008, and this ticket can likely be closed. Thanks to Louis for talking this through with us!

Generated at Thu Feb 08 05:35:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.