[DOCS-13065] Investigate changes in SERVER-9507: Optimize $sort+$group+$first pipeline to avoid full index scan Created: 01/Oct/19  Updated: 13/Nov/23  Resolved: 13/Jan/20

Status: Closed
Project: Documentation
Component/s: manual
Affects Version/s: None
Fix Version/s: 4.1.4, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Jeffrey Allen
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-9507 Optimize $sort+$group+$first pipeline... Closed
Related
is related to SERVER-69359 Aggregate query bails on DISTINCT_SCA... Closed
Participants:
Days since reply: 4 years, 19 weeks, 1 day ago
Epic Link: DOCS: 4.2 Server/Tools
Story Points: 2

 Description   

Description

Downstream Change Summary

The section titled "Pipeline Operators and Indexes" from https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes should be updated due to this change. It currently lists $match, $sort, and $geoNear as eligible to result in index use. However, this list is not exhaustive. Due to this change, a pipeline with a $group at the beginning can use an index, even if there is no $sort stage.

In addition to documenting this new behavior, I suggest that we change the language of this page so that it does not claim to be exhaustive. That is, it should not say "Stages X, Y, and Z can result in index use." Instead, it should say something like "MongoDB's query planner analyzes an aggregation pipeline in order to determine whether indexes can be used to accelerate the operation. For example, an index can be used for filtering if a $match is at the beginning of the pipeline, or can be moved to the beginning of the pipeline by the optimizer. Similarly, a $sort at the beginning of the pipeline can be computed by scanning an index in order. As a final example, $group stages which obtain the distinct values of a field can use an index for the distinct operation if they occur at the beginning of the pipeline."

Description of Linked Ticket

This is an analogue to SERVER-2094 ("distinct cheat with indexes"), but for the aggregation framework.

This performance improvement is to allow $group operators like $first to be able to take advantage of the fact that the input to the pipeline is sorted, and thus reduce the number of index entries scanned by "skipping" processing of large portions of the pipeline.

For example, suppose a user has a collection with an index {x:1,y:1}, and that x has low cardinality. Consider the following pipeline:

db.foo.aggregate({$sort:{x:1,y:1}},{$group:{_id:{x:"$x"},y:{$first:"$y"}}})

Currently, the above pipeline will perform a full scan of the index. After this optimization, the above pipeline will only have to scan on the order of |x| index entries, which is much smaller than the size of the index.

This ticket is filed as a result of discussion in SERVER-9272 (full use case available there).

Scope of changes

Impact to Other Docs

MVP (Work and Date)

Resources (Scope or Design Docs, Invision, etc.)


Generated at Thu Feb 08 08:06:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.