[SERVER-24981] $project-$limit optimization has bad repercussion on pipeline splitting Created: 11/Jul/16 Updated: 17/Apr/18 Resolved: 07/Dec/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | 3.7.1 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Antoine Hom | Assignee: | Janna Golden |
| Resolution: | Done | Votes: | 0 |
| Labels: | performance | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Backport Requested: |
v3.6
|
||||||||||||||||||||
| Sprint: | Query 2017-11-13, Query 2017-12-04, Query 2017-12-18 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
The new $project-$limit optimization in 3.2 might make the pipeline to be split much earlier than before (because it will split the pipeline at the limit step). I'm attaching two explain plan of queries, one which uses the optimization and one that doesn't because I added a $redact: $$KEEP just before the $limit. I think it would be good to take into consideration pipeline splitting when doing those optimization (in addition there is no $sort stage which would benefit from having the $limit moved up) Cheers, |
| Comments |
| Comment by Githook User [ 07/Dec/17 ] |
|
Author: {'name': 'jannaerin', 'username': 'jannaerin', 'email': 'golden.janna@gmail.com'}Message: |
| Comment by David Storch [ 12/Oct/17 ] |
|
tess.avitabile this sounds reasonable to me. I think for now we can move this out to 3.7 Desired, but this could be a good thing for janna.golden to work on after she has a little bit of ramp time on the query team. |
| Comment by Tess Avitabile (Inactive) [ 11/Oct/17 ] |
|
I like charlie.swanson's suggestion to have $sort look ahead in the pipeline past stages that preserve the number of documents for a $limit to coalesce with (where by ahead, I mean [{$sort: ...}, ..., {$limit: ...}]). There is no benefit to swapping $limit before $project except when it can find a $sort to coalesce with. And there is no harm in swapping $limit before $project when there is a $sort earlier in the pipeline, because the pipeline will be split at $sort, so the $project would not be performed on the shards anyway. asya's suggestion to duplicate the $limit when there is an intervening stage that increases the number of documents seems like a good extension. I do not think we can say whether it is always better/worse to swap $skip before $project. On a single shard, it is clearly always better. But in a sharded cluster, it depends on the expensiveness of the $project vs. the $project's reduction of the document size. Since we cannot determine whether a swap is an improvement, and there are no reported issues about the current optimization, I recommend we leave it as is. |
| Comment by Charlie Swanson [ 09/Sep/16 ] |
|
I have one idea of how to fix this: The $skip optimization might suffer from a similar problem to the one described here, and I'm not sure if/how we want to address that. The $skip/$project swap was meant to reduce the amount of work done transforming documents within $project. I'm tempted to think that this is still a worthwhile optimization. If so, we'd want to add some special logic after splitting the pipeline to see if the next stage(s) is a $project (or again something like $addFields). If there is at least one such stage, we can move it/them back to the parallel part of the shards. If we do that second piece of work, we might not need to do the first, since the same strategy would work for $limit. |
| Comment by Ramon Fernandez Marina [ 11/Jul/16 ] |
|
Thanks for your reports antoine.hom@amadeus.com, both Regards, |
| Comment by Antoine Hom [ 11/Jul/16 ] |
|
The query without redact timed out in 10+ minutes in our cluster. (because of |