Note: this is different than SERVER-22611.
WT nodes, running 3.2.10. During chunk migration - we notice severe drops in mongos query ability (from 500 req/sec to 0) - rendering the cluster unusable for several seconds . This happens when the chunk is being read from the source node, and exactly every 16mb of reads (we see a surge of disk reads at this time). Using a tester that tries to query both mongos and the source mongod at times in which the freeze happens, we noticed that only mongos is frozen - while the source mongod is responsive . To eliminate any hardware bottleneck - we used very very powerful machines, in mongos, config servers and mongods - in terms of CPU and disk IOPS.
This is how a drop looks like (every line represents a second):
444 Requests, 0 Failed (0 Timeouts) 574 Requests, 0 Failed (0 Timeouts) 516 Requests, 0 Failed (0 Timeouts) 512 Requests, 0 Failed (0 Timeouts) 463 Requests, 0 Failed (0 Timeouts) 212 Requests, 0 Failed (0 Timeouts) 28 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 203 Requests, 0 Failed (0 Timeouts) 66 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 221 Requests, 0 Failed (0 Timeouts) 69 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 320 Requests, 0 Failed (0 Timeouts) 7 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 16 Requests, 0 Failed (0 Timeouts) 98 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 0 Requests, 0 Failed (0 Timeouts) 313 Requests, 0 Failed (0 Timeouts) 783 Requests, 0 Failed (0 Timeouts) 355 Requests, 0 Failed (0 Timeouts) 682 Requests, 0 Failed (0 Timeouts)
Only the balanced collection is affected during the freeze - and we see nothing special in the logs of either mongos and the source mongod when this happens. Secondary throttling does not help.
This renders our cluster unusable during balancing - so currently our balancer is stopped. Any advice on the source of this issue and how we can address it will be extremely helpful. Thanks