Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-60521

Deadlock on stepup due to moveChunk command running uninterrupted on secondary

    • ALL
    • Show
      0001-SERVER-60521-repro.patch
    • Sharding EMEA 2021-10-18, Sharding EMEA 2022-02-21
    • 0

      Consider a shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
      In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this. Now the stepdown completes and this second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.

      Consider the following interleaving:
      1. A shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
      2. In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this.
      3. The stepdown completes
      4. The second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.
      5. The old primary that just stepped down wins the election and becomes primary again.
      6. During stepup, the primary will see that there was a migration ongoing (the one started in (1)), so it will attempt to recover it. To do so, it needs to acquire the MigrationBlockingGuard while still on drain mode. However, since the migration started in (2) managed to register on the ActiveMigrationRegistry, the MigrationBlockingGuard cannot be acquired and waits.
      7. On the other side, the migration (2) is not able to make progress because the stepup has a global lock taken, so it will never be able to release the ActiveMigrationRegistry.

      To fix this we should make sure that moveChunk cannot run uninterrupted on a secondary.

            Assignee:
            sergi.mateo-bellido@mongodb.com Sergi Mateo Bellido
            Reporter:
            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: