[SERVER-57491] Do not recreate recipient mtab if recipientForgetMigration is received after the state doc is deleted Created: 07/Jun/21  Updated: 29/Oct/23  Resolved: 12/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.0.0-rc2, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Pavithra Vetriselvan Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: pm-1791_non-cloud-blocking, post-rc0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-52713 [testing] Add stepdown/kill/terminate... Closed
is depended on by SERVER-57261 Enable tenant migration failover pass... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Repl 2021-06-14
Participants:

 Description   

If a migration commits but isn't yet forgotten/garbage collected, a subsequent read on the recipient can fail with SnapshotTooOld.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 13/Jun/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-57491: Do not recreate recipient mtab if recipientForgetMigration is received after the state doc is deleted

(cherry picked from commit 2c316c7197b5dd8885c91f4ff27d9327e986db7c)
Branch: v5.0
https://github.com/mongodb/mongo/commit/1a260dc0ebb4fc989d4613f0eb4e746f80906c35

Comment by Githook User [ 11/Jun/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-57491: Do not recreate recipient mtab if recipientForgetMigration is received after the state doc is deleted
Branch: master
https://github.com/mongodb/mongo/commit/2c316c7197b5dd8885c91f4ff27d9327e986db7c

Comment by Lingzhi Deng [ 08/Jun/21 ]

I think the issue is that if a recipientForgetMigration command is received after the recipient state doc is deleted, it would re-initialize the state doc and immediately mark it garbage collectable again. This design was to account for delayed recipientSyncData commands so that the recipient doesn't mistakenly restart the same migration that's been forgotten. But in the passthroughs, we use a very small tenantMigrationGarbageCollectionDelayMS value and so the recipient state doc is usually deleted "immediately" after it is marked garbage collectable. And if the donor retries to forget the migration due to a failover, the recipient could re-initialize the state doc in kStarted state which would create an access blocker momentarily if the state doc is updated again while it's in kStarted state until it is forgotten again. So I think a potential fix would be to initialize the state doc directly in kDone state if a recipientForgetMigration command is received after the recipient state doc is deleted.

Comment by Pavithra Vetriselvan [ 07/Jun/21 ]

Task
Logs

Based on the logs, it looks like the forgetMigration command was never able to succeed since we keep killing the donor and recipient primaries. Should our test infrastructure wait to re-route commands until the migration is forgotten instead of just after a DBHash check?

Generated at Thu Feb 08 05:41:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.