Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Gone away
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.4.0, 5.0.0, 5.1.0-rc0
Component/s: Sharding
Labels:
- sharding-wfbf-sprint
- shardingemea-qw

Operating System:
ALL
Steps To Reproduce:

Hide

0001-SERVER-60521-repro.patch

Show
0001-SERVER-60521-repro.patch
Sprint:
Sharding EMEA 2021-10-18, Sharding EMEA 2022-02-21
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Consider a shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this. Now the stepdown completes and this second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.

Consider the following interleaving:
1. A shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
2. In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this.
3. The stepdown completes
4. The second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.
5. The old primary that just stepped down wins the election and becomes primary again.
6. During stepup, the primary will see that there was a migration ongoing (the one started in (1)), so it will attempt to recover it. To do so, it needs to acquire the MigrationBlockingGuard while still on drain mode. However, since the migration started in (2) managed to register on the ActiveMigrationRegistry, the MigrationBlockingGuard cannot be acquired and waits.
7. On the other side, the migration (2) is not able to make progress because the stepup has a global lock taken, so it will never be able to release the ActiveMigrationRegistry.

To fix this we should make sure that moveChunk cannot run uninterrupted on a secondary.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

0001-SERVER-60521-repro.patch
5 kB
Oct 07 2021 02:46:27 PM UTC

is related to

SERVER-70127 Default system operations to be killable by stepdown

Closed

related to

SERVER-62245 MigrationRecovery must not assume that only one migration needs to be recovered

Closed

SERVER-60161 Deadlock between config server stepdown and _configsvrRenameCollectionMetadata command

Closed

SERVER-62296 MoveChunk should recover any unfinished migration before starting a new one

Closed

Assignee:: Sergi Mateo Bellido
Reporter:: Jordi Serra Torrens
Participants:: Jordi Serra Torrens, Kaloian Manassiev, Sergi Mateo Bellido
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Oct 07 2021 02:44:31 PM UTC
Updated:: Oct 27 2023 08:45:54 PM UTC
Resolved:: Feb 16 2022 11:53:24 AM UTC
Confidence Status Last Update:: 09/Feb/22 9:38 AM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates