In general, a tenant migration statement data will be marked as garbage collectable after donorForgetMigration. Then, it will be automatically deleted by TTLMonitor once expired or it will be deleted by the next tenant migration request on the same tenant.
But, there is a corner case that TTLMonitor thread and TenantMigrationRecipientService thread may try to delete the same tenant migration data concurrently. For example,
Time 1: [TenantMigrationRecipientService thread] check and get the existing mtab of the migration.
Time 2: [TTLMonitor thread] delete the mtab due to time expired.
Time 3: [TenantMigrationRecipientService thread] delete the mtab. Today, a ConflictingOperationInProgress error is return. This is not expected. Here is the Code to be investigated.
When checking if there is another ongoing we try to delete the existing state document only if it was marked for garbage collection so the new migration can keep on going. However the code logic handles the following situation :
- If the state document doesn't have the "expireAt" field, we return false since nDeleted will be 0.
- If the state document is deleted while we try to delete the state document, nDeleted will return 0 as well. This is wrong, if the state document is being deleted at the same time as we try to delete it, we should process with the migration as it is the same behavior as if we deleted it ourselves.
We do not have a way to differentiate the second case from the first one and therefor we throw "ConflictingOperationInProgress". This ticket is to improve the code to handle that potential race condition.
- related to
-
SERVER-70773 Skip rebuilding instance on stepup in tenant migration recipient test
- Closed