[SERVER-60521] Deadlock on stepup due to moveChunk command running uninterrupted on secondary Created: 07/Oct/21  Updated: 27/Oct/23  Resolved: 16/Feb/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.4.0, 5.0.0, 5.1.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Sergi Mateo Bellido
Resolution: Gone away Votes: 0
Labels: sharding-wfbf-sprint, shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 0001-SERVER-60521-repro.patch    
Issue Links:
Related
related to SERVER-62245 MigrationRecovery must not assume tha... Closed
related to SERVER-60161 Deadlock between config server stepdo... Closed
related to SERVER-62296 MoveChunk should recover any unfinish... Closed
is related to SERVER-70127 Default system operations to be killa... Closed
Operating System: ALL
Steps To Reproduce:

0001-SERVER-60521-repro.patch

Sprint: Sharding EMEA 2021-10-18, Sharding EMEA 2022-02-21
Participants:
Linked BF Score: 0

 Description   

Consider a shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this. Now the stepdown completes and this second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.

Consider the following interleaving:
1. A shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
2. In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this.
3. The stepdown completes
4. The second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.
5. The old primary that just stepped down wins the election and becomes primary again.
6. During stepup, the primary will see that there was a migration ongoing (the one started in (1)), so it will attempt to recover it. To do so, it needs to acquire the MigrationBlockingGuard while still on drain mode. However, since the migration started in (2) managed to register on the ActiveMigrationRegistry, the MigrationBlockingGuard cannot be acquired and waits.
7. On the other side, the migration (2) is not able to make progress because the stepup has a global lock taken, so it will never be able to release the ActiveMigrationRegistry.

To fix this we should make sure that moveChunk cannot run uninterrupted on a secondary.



 Comments   
Comment by Sergi Mateo Bellido [ 16/Feb/22 ]

We did several fixes to the moveChunk recently, removing this deadlock.

With jordi.serra-torrens we analyzed what would happen in that scenario and everything seemed ok.

Comment by Sergi Mateo Bellido [ 15/Feb/22 ]

The deadlock described in this ticket cannot happen anymore since we don't acquire the MigrationBlockingGuard as part of resumeMigrationCoordinationsOnStepUp (SERVER-62245).

Before analyzing what it would happen on master, I would like to mention two relevant tasks that we implemented recently:

  1. SERVER-62296: MoveChunk should recover any unfinished migration before starting a new one.
  2. SERVER-62704: Marking the moveChunk operation killable on step-down/step-up.

About the original problem (no changes from 1 to 5):
1. A shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
2. In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this.
3. The stepdown completes.
4. The second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.
5. The old primary that just stepped down wins the election and becomes primary again.
---- NEW STUFF ----
6. If the second moveChunk had already acquired the global lock in IX mode, the whole operation would be killed as part of the step-up. Otherwise, the second moveChunk would block until the step-up is completed and the global lock is released.
7. During stepup, the primary will see that there was a migration ongoing (the one started in (1)), so it will attempt to recover it. It is not a problem that the second moveChunk might be alive holding the ActiveMigrationRegistry since the resumeMigrationCoordinationsOnStepUp doesn't acquire the MigrationBlockingGuard anymore.
8. Once the stepup is completed, if the second moveChunk wasn't killed, it will acquire the global lock in IX and it will be executed as if it has justr arrived to the shard.

Comment by Kaloian Manassiev [ 18/Nov/21 ]

This is still a problem, but very unlikely and is not causing noise in our testing. Fix requires an iteration so putting it under the WFBF sprint category.

Generated at Thu Feb 08 05:50:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.