|
Note: this can happen only if there are more than one migrations happening in a cluster (for example, when running moveChunk manually).
Setup:
3 shards, 2 sharded collection
Description of race:
1. move 1 chunk from shard1 to shard0.
2. migrate thread performing recvChunk in shard0, fails for some reason and terminates early, setting incoming migration active state to false.
3. move 1 chunk (ideally empty so it will be fast) from shard2 to shard0. This in effect, starts a new migration and changes the state to 'done'.
4. shard1 calls _recvChunkStatus, and totally misses the transition to 'fail' state, and sees the 'done' state from migration at step#3, and it then keeps on looping until some other slow migration begins and change the state to "steady".
Attaching patch that demonstrates this race.
|