[SERVER-60132] DonorAbortMigrationCmd may leak a tenant migration Created: 21/Sep/21 Updated: 10/Oct/23 Resolved: 10/Oct/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Broadstone | Assignee: | Backlog - Service Architecture |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Service Arch
|
||||
| Operating System: | ALL | ||||
| Backport Requested: |
v5.1
|
||||
| Sprint: | Server Serverless 2021-10-11 | ||||
| Participants: | |||||
| Description |
|
The implementation of DonorAbortMigrationCmd does not record the fact that a migration has been aborted in the event that it is unable to lookup an existing migration when the command is received. There is a chance (albeit quite rare) that the donor receives an abort command before the migration it intends to abort is started, in which case the donor will report an unknown migration and then proceed to run the full migration that was presumed aborted. The solution is to durably record that the migration is in an aborted state, even if there is no existing known migration. As an addendum to this work, we should consider updating the PrimaryOnlyService documentation to mention that the lookup method should only be used strictly to observe a service but not interact with it in any way. Instead users should prefer getOrCreate to ensure that some instance (and backing state) always exists. |
| Comments |
| Comment by Suganthi Mani [ 10/Oct/23 ] |
|
This issue only affects tenant migration(TM). Shard merge(SM) and TM shares the donor service/cmds, it's not an issue for SM because the Cloud does not run the 'abortMigration' command for merges. Given that TM will retire in 7.2 and the window for this race condition to occur in production is exceedingly small, and it has been present since the introduction of tenant migration (version 5.0), I am going to close this ticket as "won't fix". |
| Comment by Mathis Bessa [ 11/Oct/21 ] |
|
It was decided to send this Jira back to the backlog as we identified that the scope of the change needed for this was getting a lot bigger than expected. When trying to set a test case to abort prior to the migration started we want to set the Doc state to be into a AbortMigration state. However it seems that in a lot of places in the code we are expecting as a first state to be kUnintialized and it seems pretty obvious that is the behavior being enforced here. The following Pull Request disclose more information about the changes that were done while working on this issue: |