[SERVER-62213] Investigate presence of multiple migration coordinator documents Created: 21/Dec/21 Updated: 21/Mar/22 Resolved: 23/Dec/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.0.5, 5.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pierlauro Sciarelli | Assignee: | Pierlauro Sciarelli |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Sprint: | Sharding EMEA 2021-12-27 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
It has been observed on a cluster the presence of 4 migration coordinator documents on one shard that led to hit this invariant on step-up. The documents were all relative to migrations for different namespaces and the states were:
The range deletions seemed to have been correctly handled both on donor and recipients:
Given the state of "decided" migrations, we can consider that:
It is then very likely that something odd happened right after, as part of the call to forgetMigration that did not remove the migration coordinators. |
| Comments |
| Comment by Pierlauro Sciarelli [ 23/Dec/21 ] |
|
jordi.serra-torrens correctly pointed out that the failure may be coming from the wait for majority committing the vector clock's config time as part of deleteMigrationCoordinatorDocumentLocally. This clearly explains why the delete of the migration coordinator was not served: because it was not reached. |
| Comment by Pierlauro Sciarelli [ 23/Dec/21 ] |
|
The leaked coordinators were not deleted after delivering a decision due to a WriteConcernFailed ("waiting for replication timed out") exception: my interpretation is that since the node was under a lot of pressure, it was probably not possible to instantly commit the deletion locally and that caused a failure of this very restrictive write concern (locally commit with a timeout of 0 seconds). This is odd because because, assuming the persistent task store's remove is honouring the wtimeout documentation, setting a timeout of 0 seconds must mean no timeout at all. While further investigating this problem, there seems also to be a way too optimistic logic associated with the failure of a migration:
It basically assumes that:
As it has been observed that other exceptions could break those assumptions, it would be reasonable to enrich the catch body to check for the kind of error and react accordingly (e.g. if the exception is not due to stepdown, handle the scenario differently). |