[SERVER-53070] Allow back to back tenant migrations for retries Created: 24/Nov/20  Updated: 29/Oct/23  Resolved: 11/Dec/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Vishnu Kaushik
Resolution: Fixed Votes: 0
Labels: pm-1791_milestone-D
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-53220 Not recover the TenantMigrationAccess... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2020-12-14
Participants:
Linked BF Score: 15

 Description   

I think we agree that:
 
If a donorStartMigration encounters a conflicting migration that is not yet marked as garbage collectable, the donorStartMigration should fail.
 
The question is what to do for:
 
If the donorStartMigration encounters a conflicting migration that is marked as garbage collectable.
 
I think the options are:

  • The donorStartMigration should fail (current).
  • The donorStartMigration should immediately garbage collect the old migration and start the new one.
    • If the new migration has a different migrationId but is for the same tenant:
      • A delayed donorStartMigration from the first migration will get ConflictingOperationInProgress, which should be harmless, since Cloud shouldn't care about the response anymore.
      • A delayed donorForgetMigration from the first migration will get NoSuchTenantMigration, which should also be harmless, since Cloud shouldn't care about the response anymore.
    • If the new migration has the same migrationId but is for a different tenant:
      • This is not a legal thing for Cloud to do, so we can say the behavior is undefined.
  • Allow donorForgetMigration to take a  "garbageCollectImmediately: true" flag that Cloud should use if they want to retry a migration quickly.
    • This is only best-effort, since it's possible for donorForgetMigration to garbage collect the state, then a delayed retry of the first donorStartMigration to restart the first migration, then Cloud tries to start the second migration and the second migration still fails since there's a conflicting active migration.

I think the second option is most practical, since it's the least amount of work for Cloud and has harmless side-effects.

EDIT:
In the end, we decided to remove the TenantMigrationAccessBlocker entry when we mark an aborted migration document as garbage collectable. In the second option, if we immediately garbage collect the old state doc and insert the new one, then it would be a problem if donor fails over in between. In that case, we could lose the old state doc without inserting the new one. If a delayed donorStartMigration from the first migration then comes in, we could mistakenly start a migration. So the property we are maintaining instead is "aborted garbage collectable documents do not have a TenantMigrationAccessBlocker entry". To maintain this property, we need to:
1. remove the mtab entry when we mark an aborted document as garbage collectable.
2. avoid creating the mtab entry for aborted garbage collectable documents when recovering mtabs from startup/rollback.
3. remove the op observer onDelete code that deletes the mtab entry for aborted state doc since this entry will have already been deleted.



 Comments   
Comment by Githook User [ 10/Dec/20 ]

Author:

{'name': 'Vishnu Kaushik', 'email': 'vishnu.kaushik@mongodb.com', 'username': 'kauboy26'}

Message: SERVER-53070 Allow back to back tenant migrations for retries
Branch: master
https://github.com/mongodb/mongo/commit/3ade1b6edcf2c4c21c0ad0fccfeb6f292baa9d1a

Comment by Vishnu Kaushik [ 08/Dec/20 ]

Just thought the final approach that was taken should be documented -
In the end, we decided to remove the TenantMigrationAccessBlocker entry when we mark an aborted migration document as garbage collectable. I guess roughly put, the property we are maintaining is "aborted garbage collectable documents do not have a TenMigrationAccessBlocker entry". To maintain this property, we need to:
1. remove the mtab entry when we mark an aborted document as garbage collectable.
2. avoid creating the mtab entry for aborted garbage collectable documents when recovering mtabs.
3. remove the op observer onDelete code that deletes the mtab entry since this entry will have already been deleted.

Comment by Lingzhi Deng [ 30/Nov/20 ]

vishnu.kaushik will implement the second option to immediately delete the old state doc if the donorStartMigration encounters a conflicting state doc that is marked as garbage collectable.

Generated at Thu Feb 08 05:29:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.