The new mongos query path schedules commands to run on the shards using the ThreadPoolTaskExecutor attached to Grid. There is a single instance of the ThreadPoolTaskExecutor per mongos process which synchronizes access to its internal queues through a single mutex. Performance testing has shown that, when there are a sufficient number of concurrent queries executing on mongos, the threads spend much of their time blocked waiting to acquire this lock. Hacking mongos to create a ThreadPoolTaskExecutor per connection thread shows a 3X increase in throughput for a find-by-_id benchRun() workload.
For reference, I'm using the following script to generate load on mongos: