[SERVER-60132] DonorAbortMigrationCmd may leak a tenant migration Created: 21/Sep/21  Updated: 10/Oct/23  Resolved: 10/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Broadstone Assignee: Backlog - Service Architecture
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Assigned Teams:
Service Arch
Operating System: ALL
Backport Requested:
v5.1
Sprint: Server Serverless 2021-10-11
Participants:

 Description   

The implementation of DonorAbortMigrationCmd does not record the fact that a migration has been aborted in the event that it is unable to lookup an existing migration when the command is received. There is a chance (albeit quite rare) that the donor receives an abort command before the migration it intends to abort is started, in which case the donor will report an unknown migration and then proceed to run the full migration that was presumed aborted. The solution is to durably record that the migration is in an aborted state, even if there is no existing known migration.

As an addendum to this work, we should consider updating the PrimaryOnlyService documentation to mention that the lookup method should only be used strictly to observe a service but not interact with it in any way. Instead users should prefer getOrCreate to ensure that some instance (and backing state) always exists.



 Comments   
Comment by Suganthi Mani [ 10/Oct/23 ]

This issue only affects tenant migration(TM). Shard merge(SM) and TM shares the donor service/cmds, it's not an issue for SM because the Cloud does not run the 'abortMigration' command for merges. Given that TM will retire in 7.2 and the window for this race condition to occur in production is exceedingly small, and it has been present since the introduction of tenant migration (version 5.0), I am going to close this ticket as "won't fix".

Comment by Mathis Bessa [ 11/Oct/21 ]

It was decided to send this Jira back to the backlog as we identified that the scope of the change needed for this was getting a lot bigger than expected. When trying to set a test case to abort prior to the migration started we want to set the Doc state to be into a AbortMigration state. However it seems that in a lot of places in the code we are expecting as a first state to be kUnintialized and it seems pretty obvious that is the behavior being enforced here.

The following Pull Request disclose more information about the changes that were done while working on this issue:

https://github.com/10gen/mongo/pull/1087

Generated at Thu Feb 08 05:49:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.