Projection pushdown makes performance worse for very small documents (< 1000 bytes). According to profiles we've taken, the root cause of this is the batching done by DocumentSourceCursor.
When a $project is before the batching (pushed down to PlanStage layer) 16MB of results are collected before handing off to the aggregation layer. For a workload with small documents, this means that a lot of allocations are done (for things like DocumentStorages, DocumentStorages' caches and so on) before any of it is freed. The amount of memory allocated grows until the batch is full, and then, once the batch is full, starts to shrink again.
On the other hand, when the $project is after the batching (in the agg layer), the stage will perform some allocations, but the memory will be deallocated immediately after, when the document is serialized to BSON, or when a stage like $group is reached.
Although the total number of allocations is the same between the two plans, the total amount of memory allocated at any given time is much smaller for the plan where $project is in the aggregation layer than it is for the plan where $project is pushed down. The case where the $project is pushed down causes the memory allocator to perform poorly. When using tcmalloc, for example, we see that the plan where $project is pushed down causes the thread-local cache to be exhausted, forcing the allocator fetch from the "central free list" (see here for brief description of tcmalloc). The plan with $project in the aggregation layer does not exhibit this behavior.
This can be reproduced using the pipeline:
When populating the collection with extremely small documents and disabling projection pushdown, the query performs significantly (25-30%) better. The data we used was the data generated as part of the map_reduce sys performance workload.
We spent a decent amount of time confirming that this is caused by batching, and not by other differences between the find and agg layers, such as lock yielding. In one experiment we were able to reproduce the regression by introducing a $_internalBatch which just spools K documents and then unspools them.
It is important to note that in the general case projection pushdown is beneficial. Since the DocumentSource layer does not hold locks, documents must not point to any resources owned by the storage layer and must be copied (or "owned" via getOwned()). Inclusion projections always produce "owned" Documents, so pushdown into the PlanStage layer means that less data has to be copied across the find/agg membrane. The time saved is usually proportional to the size of the document.
For this case with tiny documents, the time saved is negligible, as the overhead caused by applying the projection dominates.
Based on rough numbers gather locally using this made-up map reduce workload, plans with and without pushdown seem to "tie" for documents of size 1kb. For documents of size 10kb, the plan with pushdown does ~30% percent better. For documents of size 1MB, the plan with pushdown does > 2x better.
The ultimate problem here is the batching that DocumentSourceCursor does. If we can get rid of that (certainly no small feat), this sort of issue should go away.