[SERVER-9328] Race in migrate protocol can cause moveChunk cmd to hang Created: 11/Apr/13  Updated: 11/Jul/16  Resolved: 08/Jul/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.1
Fix Version/s: 2.5.1

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File patch    
Issue Links:
Depends
Operating System: ALL
Participants:

 Description   

Note: this can happen only if there are more than one migrations happening in a cluster (for example, when running moveChunk manually).

Setup:
3 shards, 2 sharded collection

Description of race:
1. move 1 chunk from shard1 to shard0.
2. migrate thread performing recvChunk in shard0, fails for some reason and terminates early, setting incoming migration active state to false.
3. move 1 chunk (ideally empty so it will be fast) from shard2 to shard0. This in effect, starts a new migration and changes the state to 'done'.
4. shard1 calls _recvChunkStatus, and totally misses the transition to 'fail' state, and sees the 'done' state from migration at step#3, and it then keeps on looping until some other slow migration begins and change the state to "steady".

Attaching patch that demonstrates this race.



 Comments   
Comment by auto [ 08/Jul/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-9328 Race in migrate protocol can cause moveChunk cmd to hang
Branch: master
https://github.com/mongodb/mongo/commit/9a5e0a4f1659a1c8f30ffdfd2311babb318c62b7

Generated at Thu Feb 08 03:20:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.