[SERVER-62245] MigrationRecovery must not assume that only one migration needs to be recovered Created: 23/Dec/21  Updated: 29/Oct/23  Resolved: 30/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 5.2.0, 5.1.0
Fix Version/s: 5.3.0, 5.1.2, 5.0.6, 5.2.0-rc4

Type: Bug Priority: Critical - P2
Reporter: Jordi Serra Torrens Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File repro-62245.patch    
Issue Links:
Backports
Problem/Incident
is caused by SERVER-50174 Multiple concurrent migration recover... Closed
Related
related to SERVER-62296 MoveChunk should recover any unfinish... Closed
is related to SERVER-60521 Deadlock on stepup due to moveChunk c... Closed
is related to SERVER-62213 Investigate presence of multiple migr... Closed
is related to SERVER-62243 Wait for vector clock document majori... Closed
is related to SERVER-62316 Remove the workaround for SERVER-6224... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.2, v5.1, v5.0
Steps To Reproduce:

repro-62245.patch

./buildscripts/resmoke.py run --storageEngine=wiredTiger --storageEngineCacheSizeGB=.50 --suite=sharding  jstests/sharding/recover_multiple_migrations_on_stepup.js --log=file

Sprint: Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10
Participants:
Case:

 Description   
Issue and status as of Dec 30, 2021

ISSUE DESCRIPTION AND IMPACT

This issue can cause unavailability of a shard in sharded clusters running MongoDB versions 5.0.0 - 5.0.5 and 5.1.0 - 5.1.1. Next versions are not affected.

The problem can potentially occur if all of the following conditions have been met at least once:

  • More than one sharded collection
  • Multiple migrations
  • Intense write workloads or hardware failures

Symptom of the bug: mongod process crashing upon step-up due to an invariant failure with the following message: "Upon step-up a second migration coordinator was found".

REMEDIATION AND WORKAROUNDS

  • Restart nodes of the shard as replica set
  • Double-check that at most one migration coordinator document does not have a definitive decision.
  • For each migration coordinator document with a definitive decision, double-check that range deletion tasks are consistent with migration coordinators (same range and collectionUUID, if present):
    • Aborted decision:
      — No range deletion document on donor
      — Zero or one ready range deletion document on recipient
    • Committed decision:
      — Zero or one ready range deletion document on donor
      — No range deletion document on recipient
    • No decision:
      — One pending range deletion tasks on donor
      — One pending range deletion tasks on recipient
  • Majority-delete all migration coordinators with a definitive decision
  • Restart nodes as shard

TECHNICAL DETAILS

Migration coordinators:

  • Documents persisted locally on shards in the internal collection config.migrationCoordinators 
  • The structure of migration coordinator documents can be found here.

Range deletion tasks:

  • Documents persisted locally on shards in the internal collection config.rangeDeletions
  • The structure of range deletion task documents can be found here

 


--- Original ticket description ---

There are several situations that can lead to more than one migration (for different collections) needing recovery on stepup. For example, when a migration fails here we only clear the collection's filtering metadata so that the next access to the collection will trigger the recovery, and then release the ActiveMigrationRegistry. At this point, nothing prevents a migration to a different collection from starting, so now if the shard stepped down it would have two migrations to recover.

This invariant along with taking the MigrationBlockingGuard on stepup migration recovery was added on SERVER-50174. It was meant to prevent migrations to different collections before the unfinished migrations found on stepup are recovered. However, as described above, situations where there are multiple migrations pending recovery are still possible in non-stepping situations.

The fact that a different migration (to another collection) starts using the same lsid as the migration pending recovery should not be a problem. The new migration will use a txnNumber that is two more than the previous migration. This will effectively be the same as advancing the txn number: It will prevent the first migration from using its original (lsid, txnNumber pair). The fact that a recovering migration gets a TransactionTooOld error when advancing the txnNumber on the recipient is not fully safe to ignore, because TransactionTooOld does not guarantee that a rollback can't occur, after which the original txnNumber could still be valid.

This ticket will provide a fix so that clusters that are already in the faulty situation of having several migrations pending to be recover don't hit the invariant on stepup anymore. SERVER-62296 will avoid this faulty situation from happening again.



 Comments   
Comment by Githook User [ 28/Jun/22 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-62316 Remove the workaround for SERVER-62245
Branch: master
https://github.com/mongodb/mongo/commit/da17e726a6fedddeb525229f7afa93dbce7f94d6

Comment by Githook User [ 30/Dec/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-62245 MigrationRecovery must not assume that only one migration needs to be recovered

(cherry picked from commit 8e6ab9a259d921298940190161fadfd118c6dc15)
Branch: v5.0
https://github.com/mongodb/mongo/commit/160cc06cd9dc4861ebe0678ed1a9286e21aef8ab

Comment by Githook User [ 30/Dec/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-62245 MigrationRecovery must not assume that only one migration needs to be recovered

(cherry picked from commit 8e6ab9a259d921298940190161fadfd118c6dc15)
Branch: v5.1
https://github.com/mongodb/mongo/commit/8326031e5207e4f000ec81e1e51981e370edbab9

Comment by Githook User [ 30/Dec/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-62245 MigrationRecovery must not assume that only one migration needs to be recovered

(cherry picked from commit 8e6ab9a259d921298940190161fadfd118c6dc15)
Branch: v5.2
https://github.com/mongodb/mongo/commit/5a4409cca94498a8e7810ffedf4a049053db2c46

Comment by Githook User [ 30/Dec/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-62245 MigrationRecovery must not assume that only one migration needs to be recovered
Branch: master
https://github.com/mongodb/mongo/commit/8e6ab9a259d921298940190161fadfd118c6dc15

Comment by Tommaso Tocci [ 23/Dec/21 ]

This bug has been introduced by SERVER-50174 and has been exacerbated by SERVER-49192

Generated at Thu Feb 08 05:54:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.