Note: this is different than
WT nodes, running 3.2.10. During chunk migration - we notice severe drops in mongos query ability (from 500 req/sec to 0) - rendering the cluster unusable for several seconds . This happens when the chunk is being read from the source node, and exactly every 16mb of reads (we see a surge of disk reads at this time). Using a tester that tries to query both mongos and the source mongod at times in which the freeze happens, we noticed that only mongos is frozen - while the source mongod is responsive . To eliminate any hardware bottleneck - we used very very powerful machines, in mongos, config servers and mongods - in terms of CPU and disk IOPS.
This is how a drop looks like (every line represents a second):
Only the balanced collection is affected during the freeze - and we see nothing special in the logs of either mongos and the source mongod when this happens. Secondary throttling does not help.
This renders our cluster unusable during balancing - so currently our balancer is stopped. Any advice on the source of this issue and how we can address it will be extremely helpful. Thanks