[SERVER-50890] Failure to persist migration coordinator document leads to hung migration Created: 11/Sep/20  Updated: 29/Oct/23  Resolved: 08/Jan/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Marcos José Grillo Ramirez
Resolution: Fixed Votes: 0
Labels: PM-1645-Milestone-1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-51472 Assertion during early stages of migr... Closed
Problem/Incident
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2020-10-19, Sharding 2020-11-02, Sharding 2020-11-16, Sharding 2020-11-30, Sharding 2020-12-14, Sharding 2020-12-28, Sharding 2021-01-11, Sharding 2021-01-25
Participants:
Linked BF Score: 24

 Description   

When starting the cloning phase of a migration, the donor shard will insert a document into the config.migrationCoordinators collection. If this fails and throws an exception, it will trigger the MigrationSourceManager::cleanupOnError() scope guard, which will try to complete the migration by persisting an abort decision through an update to the document that failed to be inserted, which will fail because there is no matching document. Persisting the decision retries on errors until a stepdown or shutdown, so until that happens, the migration will hang trying to update the non-existent document.

UPDATE fixing this issue exposed another problem, if a migration coordinator document is left without decision (like for example, because the document insert failed to honor the majority write concern) then another migration on the same session that increases the transaction number would cause the bump of the txnNumber to fail during the next recovery with a TransactionTooOld error, as can be seen on the linked BF.



 Comments   
Comment by Githook User [ 08/Jan/21 ]

Author:

{'name': 'Marcos José Grillo Ramírez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-50890 Improve failover management when completing migrations
Branch: master
https://github.com/mongodb/mongo/commit/40aa110c655b6a3b562881c63d14a83c0848b3a0

Comment by Jack Mulrow [ 23/Oct/20 ]

matthew.saltz, you're right. I changed it to donor shard.

Comment by Matthew Saltz (Inactive) [ 23/Oct/20 ]

Should that first sentence say "the donor shard"?

Generated at Thu Feb 08 05:23:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.