Race in migrate protocol can cause moveChunk cmd to hang

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • 2.5.1
    • Affects Version/s: 2.4.1
    • Component/s: Sharding
    • None
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Note: this can happen only if there are more than one migrations happening in a cluster (for example, when running moveChunk manually).

      Setup:
      3 shards, 2 sharded collection

      Description of race:
      1. move 1 chunk from shard1 to shard0.
      2. migrate thread performing recvChunk in shard0, fails for some reason and terminates early, setting incoming migration active state to false.
      3. move 1 chunk (ideally empty so it will be fast) from shard2 to shard0. This in effect, starts a new migration and changes the state to 'done'.
      4. shard1 calls _recvChunkStatus, and totally misses the transition to 'fail' state, and sees the 'done' state from migration at step#3, and it then keeps on looping until some other slow migration begins and change the state to "steady".

      Attaching patch that demonstrates this race.

        1. patch
          3 kB
          Randolph Tan

            Assignee:
            Randolph Tan
            Reporter:
            Randolph Tan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: