[SERVER-62216] When abortTenantIndexBuilds failed to abort during a tenant migration, we should wait for the createIndex to finish before continuing the MTM Created: 21/Dec/21 Updated: 29/Oct/23 Resolved: 02/Feb/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.1.0 |
| Fix Version/s: | 5.3.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mathis Bessa | Assignee: | Mathis Bessa |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Server Serverless 2021-12-27, Server Serverless 2022-01-10, Server Serverless 2022-01-24, Server Serverless 2022-02-07 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 15 | ||||||||||||
| Description |
|
We found out that when the tenant_migration_donor_service calls to abort the index build, we end up not aborting the index build because of the current state of the createIndex which here was kCommitQuorumSatisfied We should handle when IndexBuildsCoordinator::abortTenantIndexBuilds calls It was decided instead we should wait for the indexBuilds to finish unregistering the index build. There could be a race condition where we would return committed but the index build hasn't unregistered yet and this could be an issue too. The solution to this is to always wait for the unregisterIndexBuild to finish. In addition to this we will also handle that case to not be an error in the tenant_migration.py hook here : |
| Comments |
| Comment by Githook User [ 22/Jan/22 ] |
|
Author: {'name': 'mathisbessamdb', 'email': 'mathis.bessa@mongodb.com', 'username': 'mathisbessamdb'}Message: |
| Comment by Suganthi Mani [ 31/Dec/21 ] |
|
Just hit my brain, technically, for the case where abortIndexBuildByBuildUUID() returns true for TryAbortResult::kAlreadyAborted(), it is guaranteed that the index build has completed aborting process i.e, an abortIndexBuild oplog entry has generated for that index build. Basically, abortIndexBuildByBuildUUID() This means, we can't have 2 concurrent threads one (tenantMigration thread) executing tryAbort() and other(e.g., dropCollection thread) executing _completeAbort(). UPDATE: So, I think, it's safer to wait for the index build to get removed from the index build registry. CC mathis.bessa |
| Comment by Suganthi Mani [ 31/Dec/21 ] |
|
For Future reference: Following is the another scenario where we can hit the same issue as BF-23463, IndexBuildsCoordinator::abortTenantIndexBuilds() calls abortIndexBuildByBuildUUID() to abort the index builds. abortIndexBuildByBuildUUID() returns BF-23463 showcases the case #1 "where the index build is already in the process of committing". But we can hit the same BF issue for case #2 "the index build is already in the process of being aborted". Looking into the code, we can see ReplIndexBuildState::tryAbort() returns TryAbortResult::kAlreadyAborted() if the index build has already received signal "IndexBuildAction::kPrimaryAbort" (Before we drop the indexes/collection/databases, we try to abort the active index builds using signal "kPrimaryAbort" ) but trying to set "IndexBuildAction::kTenantMigrationAbort". However, for TryAbortResult::kAlreadyAborted() IndexBuildsCoordinator::abortTenantIndexBuilds() returns true. And for this case, it's not guaranteed that index build has finished completing aborting process i.e, has generated the "abortIndexBuild" oplog entry. This means there is a possibility that an "abortIndexBuild" can be generated after the tenant migration has started (i.e, after _startApplyingDonorOpTime), leading to the server crash. In order to tackle both the problematic scenarios , our proposal is to wait for the index build to complete (specifically make sure the index build has finished generating the commit/abortIndexBuild oplog entry) irrespective of abortIndexBuildByBuildUUID() return values. There are currently 2 options: 1) Wait for this ReplIndexBuildState::sharedPromise to get fulfilled.
2) Wait for the index build to get removed from the index build registry.
CC mathis.bessa |