[SERVER-9507] Optimize $sort+$group+$first pipeline to avoid full index scan Created: 29/Apr/13  Updated: 07/Sep/22  Resolved: 26/Sep/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.4.3
Fix Version/s: 4.1.4

Type: Improvement Priority: Major - P3
Reporter: Backlog - Query Team (Inactive) Assignee: Justin Seyster
Resolution: Done Votes: 9
Labels: 4.1.3, asya, mock-pm, optimization, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-36517 Allow allPaths indexes to provide DIS... Closed
Documented
is documented by DOCS-13065 Investigate changes in SERVER-9507: O... Closed
Duplicate
is duplicated by SERVER-9272 Querying latest document based on a s... Closed
is duplicated by SERVER-31269 Too many documents examined when usin... Closed
Related
related to SERVER-37459 $group with $$ROOT returns error Closed
related to SERVER-4507 aggregation: optimize $group to take... Backlog
related to SERVER-23732 Aggregation should optimize an irrele... Backlog
related to SERVER-28980 aggregation can subsume $sort into $g... Backlog
related to SERVER-37715 Use DISTINCT_SCAN for $unwind-$group ... Backlog
related to SERVER-37304 Extend $sort+$group+$first pipeline o... Closed
related to SERVER-40090 DISTINCT_SCAN in agg is only used whe... Closed
related to SERVER-55576 Optimize queries on time-series colle... Closed
is related to SERVER-69359 Aggregate query bails on DISTINCT_SCA... Closed
is related to SERVER-2130 Ability to use Limit() with Distinct() Backlog
is related to SERVER-27915 Make $group with $addToSet accumulato... Backlog
is related to SERVER-2094 distinct cheat with indexes Closed
is related to SERVER-15291 slow '$group' performance Closed
is related to SERVER-29244 CLONE - distinct cheat with indexes Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 2018-06-04, Query 2018-08-13, Query 2018-08-27, Query 2018-09-10, Query 2018-09-24, Query 2018-10-08
Participants:
Case:

 Description   

This is an analogue to SERVER-2094 ("distinct cheat with indexes"), but for the aggregation framework.

This performance improvement is to allow $group operators like $first to be able to take advantage of the fact that the input to the pipeline is sorted, and thus reduce the number of index entries scanned by "skipping" processing of large portions of the pipeline.

For example, suppose a user has a collection with an index {x:1,y:1}, and that x has low cardinality. Consider the following pipeline:

db.foo.aggregate({$sort:{x:1,y:1}},{$group:{_id:{x:"$x"},y:{$first:"$y"}}})

Currently, the above pipeline will perform a full scan of the index. After this optimization, the above pipeline will only have to scan on the order of |x| index entries, which is much smaller than the size of the index.

This ticket is filed as a result of discussion in SERVER-9272 (full use case available there).



 Comments   
Comment by Githook User [ 26/Sep/18 ]

Author:

{'name': 'Justin Seyster', 'email': 'justin.seyster@mongodb.com', 'username': 'jseyster'}

Message: SERVER-9507 Optimize $sort+$group+$first pipeline to use DISTINCT_SCAN
Branch: master
https://github.com/mongodb/mongo/commit/da63195cc421f8f29c1c0bef5fa2c2226d230dfd

Comment by Ian Whalen (Inactive) [ 20/Sep/18 ]

Target date: definitely end of this sprint (8 weeks).
"even with build baron" - Justin

Comment by Ian Whalen (Inactive) [ 06/Sep/18 ]

Target date of: end of this sprint. (6 weeks)

Comment by J Rassi [ 02/May/13 ]

Correct, the aggregation framework currently cannot use an index to help optimize those pipelines (which is unrelated to this ticket). If an index cannot be used to satisfy a $sort, then an in-memory sort is performed, in which case all documents in the pipeline have to be examined anyway (so no significant performance improvement can be made). If an index can be used to satisfy a $sort, and only a small subset of documents are needed by a later pipeline stage (in a way that the sort order can be employed), then the optimization suggested here will drastically reduce the number of index entries scanned.

Comment by Mervin San Andres [ 30/Apr/13 ]

What if, prior to this set of operators, I would have to perform other pipeline operators such as $project and $unwind? As far as I know, indexes can no longer be used after the transformation.

Generated at Thu Feb 08 03:20:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.