[SERVER-57491] Do not recreate recipient mtab if recipientForgetMigration is received after the state doc is deleted Created: 07/Jun/21 Updated: 29/Oct/23 Resolved: 12/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.0-rc2, 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pavithra Vetriselvan | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | pm-1791_non-cloud-blocking, post-rc0 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||
| Sprint: | Repl 2021-06-14 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
If a migration commits but isn't yet forgotten/garbage collected, a subsequent read on the recipient can fail with SnapshotTooOld. |
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] |
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! |
| Comment by Githook User [ 13/Jun/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 2c316c7197b5dd8885c91f4ff27d9327e986db7c) |
| Comment by Githook User [ 11/Jun/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: |
| Comment by Lingzhi Deng [ 08/Jun/21 ] |
|
I think the issue is that if a recipientForgetMigration command is received after the recipient state doc is deleted, it would re-initialize the state doc and immediately mark it garbage collectable again. This design was to account for delayed recipientSyncData commands so that the recipient doesn't mistakenly restart the same migration that's been forgotten. But in the passthroughs, we use a very small tenantMigrationGarbageCollectionDelayMS value and so the recipient state doc is usually deleted "immediately" after it is marked garbage collectable. And if the donor retries to forget the migration due to a failover, the recipient could re-initialize the state doc in kStarted state which would create an access blocker momentarily if the state doc is updated again while it's in kStarted state until it is forgotten again. So I think a potential fix would be to initialize the state doc directly in kDone state if a recipientForgetMigration command is received after the recipient state doc is deleted. |
| Comment by Pavithra Vetriselvan [ 07/Jun/21 ] |
|
Based on the logs, it looks like the forgetMigration command was never able to succeed since we keep killing the donor and recipient primaries. Should our test infrastructure wait to re-route commands until the migration is forgotten instead of just after a DBHash check? |