Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9328

Race in migrate protocol can cause moveChunk cmd to hang

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • 2.5.1
    • 2.4.1
    • Sharding
    • None
    • ALL

    Description

      Note: this can happen only if there are more than one migrations happening in a cluster (for example, when running moveChunk manually).

      Setup:
      3 shards, 2 sharded collection

      Description of race:
      1. move 1 chunk from shard1 to shard0.
      2. migrate thread performing recvChunk in shard0, fails for some reason and terminates early, setting incoming migration active state to false.
      3. move 1 chunk (ideally empty so it will be fast) from shard2 to shard0. This in effect, starts a new migration and changes the state to 'done'.
      4. shard1 calls _recvChunkStatus, and totally misses the transition to 'fail' state, and sees the 'done' state from migration at step#3, and it then keeps on looping until some other slow migration begins and change the state to "steady".

      Attaching patch that demonstrates this race.

      Attachments

        Activity

          People

            randolph@mongodb.com Randolph Tan
            randolph@mongodb.com Randolph Tan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: