As described in the MongoDB Manual (http://docs.mongodb.org/manual/core/aggregation-pipeline-sharded-collections/), from 2.6 onwards aggregation pipelines are executed as follows in a sharded cluster:
When operating on a sharded collection, the aggregation pipeline is split into two parts. The first pipeline runs on each shard, or if an early $match can exclude shards through the use of the shard key in the predicate, the pipeline runs on only the relevant shards.
The second pipeline consists of the remaining pipeline stages and runs on the primary shard. The primary shard merges the cursors from the other shards and runs the second pipeline on these results. The primary shard forwards the final results to the mongos.
For long-running aggregation queries that aggregate a lot of data, the second part of the pipeline (running on the primary shard) is a bottleneck to performance where access to data is sufficiently fast that the query becomes bound by CPU, not I/O. The second part of the pipeline runs in a single thread on the primary shard.
To improve performance for such long-running CPU-bound aggregations, it would be good to add multiple levels of merging such that the 'merger' role can be distributed to multiple shards. For example, in a sharded cluster with 16 shards, have 4 'first level' mergers each of which are responsible for merging the results from 4 shards, then a 'second level' merger which merges the results from the first level mergers and returns the result to mongos.