Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-65969

Migration completion must not be signaled before releasing the ActiveMigrationRegistry

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.0.0-rc8, 6.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Fully Compatible
    • ALL
    • v6.0
    • Sharding EMEA 2022-05-16, Sharding EMEA 2022-05-30

      This may be a very rare race condition, but it's worth mentioning it since it has required a lot of investigation on a failing test in a patch. It can happen if the CSRS steps down during any tests issuing 2 subsequent moveChunk commands on different ranges (e.g. here).

      When a _shardsvrMoveRange command (moveChunk in previous versions) is joining an ongoing migration, it waits for the completion of the original migration that is signaled before releasing the ActiveMigrationRegistry.

      As a result, the following flow could be reproduced:

      1. Router sends moveChunk to CSRS node A
      2. CSRS node A sends _shardsvrMoveRange to shard
      3. CSRS node A steps-down and CSRS node B steps up
      4. Router receives an error from CSRS node A, retries the moveChunk
      5. CSRS node B sends _shardsvrMoveRange to shard, joining ongoing migration
      6. The ongoing migration succeeds, signals completion before releasing the ActiveMigrationRegistry
      7. [very fast] Router receives success from CSRS node B, sends a new moveChunk for a different range
      8. [very fast] CSRS B forwards the new operation to the shard
      9. Shard replies with error because the ActiveMigrationRegistry has not been released yet (so the test fails)

            Assignee:
            paolo.polato@mongodb.com Paolo Polato
            Reporter:
            pierlauro.sciarelli@mongodb.com Pierlauro Sciarelli
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: