[SERVER-62213] Investigate presence of multiple migration coordinator documents Created: 21/Dec/21  Updated: 21/Mar/22  Resolved: 23/Dec/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.5, 5.1.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Pierlauro Sciarelli
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Related
related to SERVER-62245 MigrationRecovery must not assume tha... Closed
related to SERVER-62243 Wait for vector clock document majori... Closed
Sprint: Sharding EMEA 2021-12-27
Participants:
Case:

 Description   

It has been observed on a cluster the presence of 4 migration coordinator documents on one shard that led to hit this invariant on step-up.

The documents were all relative to migrations for different namespaces and the states were:

  • 2 aborted
  • 1 committed
  • 1 without decision

The range deletions seemed to have been correctly handled both on donor and recipients:

  • No range deletion documents for the aborted migrations (range deletion tasks already executed)
  • Ready range deletion task on the donor for the committed migration
  • Pending range deletions on donor/receiver for the migration without decision

Given the state of "decided" migrations, we can consider that:

It is then very likely that something odd happened right after, as part of the call to forgetMigration that did not remove the migration coordinators.



 Comments   
Comment by Pierlauro Sciarelli [ 23/Dec/21 ]

jordi.serra-torrens correctly pointed out that the failure may be coming from the wait for majority committing the vector clock's config time as part of deleteMigrationCoordinatorDocumentLocally. This clearly explains why the delete of the migration coordinator was not served: because it was not reached.

Comment by Pierlauro Sciarelli [ 23/Dec/21 ]

The leaked coordinators were not deleted after delivering a decision due to a WriteConcernFailed ("waiting for replication timed out") exception: my interpretation is that since the node was under a lot of pressure, it was probably not possible to instantly commit the deletion locally and that caused a failure of this very restrictive write concern (locally commit with a timeout of 0 seconds). This is odd because because, assuming the persistent task store's remove is honouring the wtimeout documentation, setting a timeout of 0 seconds must mean no timeout at all.

While further investigating this problem, there seems also to be a way too optimistic logic associated with the failure of a migration:

  • Log a warning
  • Clear the filtering metadata for the migration's namespace

It basically assumes that:

  • The exception was due to the primary stepping down
  • The migration will be resumed to the next step-up

As it has been observed that other exceptions could break those assumptions, it would be reasonable to enrich the catch body to check for the kind of error and react accordingly (e.g. if the exception is not due to stepdown, handle the scenario differently).

Generated at Thu Feb 08 05:54:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.