[DOCS-15221] Indicate that ONLY specific aggregation stages an utilize indexes Created: 08/Apr/22  Updated: 30/Oct/23  Due: 02/Sep/22  Resolved: 20/Oct/22

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: Server_Docs_20231030

Type: Improvement Priority: Major - P3
Reporter: Alex Bevilacqua Assignee: Nick Villahermosa
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:
Days since reply: 1 year, 8 weeks, 2 days ago
Epic Link: DOCSP-11701

 Description   

In https://www.mongodb.com/docs/manual/core/aggregation-pipeline-optimization/#std-label-aggregation-pipeline-optimization-indexes-and-filters the instructions are not explicit, and the resulting ambiguity can confuse users who are unsure as to why their pipeline didn't use an index.

The query planner analyzes an aggregation pipeline to determine if indexes can be used to improve pipeline performance.

The following list shows some pipeline stages that can use indexes:

This should be reworded as:

The query planner analyzes an aggregation pipeline to determine if indexes from the collection targeted by the aggregate command can be used to improve pipeline performance. Only the first pipeline stage can benefit from indexes, and only if that stage is one of the following:



 Comments   
Comment by Githook User [ 12/Dec/22 ]

Author:

{'name': 'Nick Villahermosa', 'email': 'nick.villahermosa@mongodb.com', 'username': 'nvillahermosa-mdb'}

Message: DOCS-15221 updated with initial copy and tech review input (#1974)
Branch: indexes-refactor
https://github.com/10gen/docs-mongodb-internal/commit/74287114ce4e7ce43c0a7d3bb16b0c677ddfbf17

Comment by Githook User [ 18/Oct/22 ]

Author:

{'name': 'Nick Villahermosa', 'email': 'nick.villahermosa@mongodb.com', 'username': 'nvillahermosa-mdb'}

Message: DOCS-15221 updated with initial copy and tech review input (#1974) (#2025)
Branch: v4.2
https://github.com/10gen/docs-mongodb-internal/commit/54f56d634cb745b03a5ac3a4b507ed80c5a6f9d1

Comment by Githook User [ 18/Oct/22 ]

Author:

{'name': 'Nick Villahermosa', 'email': 'nick.villahermosa@mongodb.com', 'username': 'nvillahermosa-mdb'}

Message: DOCS-15221 updated with initial copy and tech review input (#1974) (#2023)
Branch: v5.0
https://github.com/10gen/docs-mongodb-internal/commit/db01fd5161c10f8c1511ffee02d2418453fe74bf

Comment by Githook User [ 18/Oct/22 ]

Author:

{'name': 'Nick Villahermosa', 'email': 'nick.villahermosa@mongodb.com', 'username': 'nvillahermosa-mdb'}

Message: DOCS-15221 updated with initial copy and tech review input (#1974) (#2022)
Branch: v6.0
https://github.com/10gen/docs-mongodb-internal/commit/7b72372e813eb67a43073d7bc373ecba11a1bcde

Comment by Githook User [ 11/Oct/22 ]

Author:

{'name': 'Nick Villahermosa', 'email': 'nick.villahermosa@mongodb.com', 'username': 'nvillahermosa-mdb'}

Message: DOCS-15221 updated with initial copy and tech review input (#1974)
Branch: master
https://github.com/10gen/docs-mongodb-internal/commit/74287114ce4e7ce43c0a7d3bb16b0c677ddfbf17

Comment by Nick Villahermosa [ 08/Sep/22 ]

Thanks christopher.harris@mongodb.com and alex.bevilacqua@mongodb.com for diving into this. I'll incorporate the suggested changes and submit another PR once that's done.

Comment by Chris Harris [ 08/Sep/22 ]

To clear up the "not only the first stage can benefit" piece, perhaps wording along these lines?

The leading stages in an aggregation pipeline can benefit from indexes, but usage of an index requires that the first stage (after optimizations) is one of the following:

And then to help address the fact that other stages later in the pipeline can also use indexes, perhaps append some additional text afterwards along the lines of:

Furthermore, other aggregation stages that retrieve new data from collections later in the pipeline are also eligible for index usage.  These include:

  • $lookup
  • ...
Comment by Alex Bevilacqua [ 08/Sep/22 ]

christopher.harris@mongodb.com, how would you recommend adjusting the documentation? The information shared is extremely useful, however based on the nuances you've described it might be best if there were something actionable that could be applied to the current proposed changes in https://github.com/10gen/docs-mongodb-internal/pull/1774/files

Comment by Chris Harris [ 07/Sep/22 ]

Hmm, this section of the documentation might be particularly tricky to get right. 

To attempt to help clear up some of the initial confusion here - the clarification that index eligibility is predicated on the first stage of the pipeline is not a claim that using an index can only benefit the first stage of the pipeline. So the suggested text in the description of "Only the first pipeline stage can benefit from indexes" is not correct. A simple example of this could be a pipeline that consists of [ $match, $sort, $group ]. If an appropriate index is present, then it could be used to do all of the following:

  • Efficiently identify the relevant data for the $match stage
  • Find the data in sorted order for the $sort stage
  • Return all of the field values needed to execute the $group stage (eg a covered query)

So generally speaking, index eligibility is determined by the first stage of the pipeline but usage of the index can benefit subsequent stages. What we're probably trying to clarify here is that a user can't, for example, perform some transformation of their data in the first stages of the pipeline and then expect the database to be able to use an index to service a $match or $sort that is found at the end of their pipeline.

There's a bunch of nuance here though when it comes to "first" and when it comes to explicitly listing the stages. The two things that come to mind are:

  • As Alex (and the target docs page itself as it just so happens) mentioned - the database will reorder and coalesce stages. So the first stage of the pipeline when the database checks for index eligibility may be different than what the user actually wrote/executed. This is specifically done to allow the system to execute aggregations more efficiently (which often means attempting to use an index) so it's nothing to be worried or concerned about, but it certainly changes what "first" might mean.
  • It is strictly not true that there are never subsequent pipeline stages that can separately use an index. The 'obvious' ones here are $lookup, $graphLookup, and $unionWith. All of those will definitely attempt to use indexes AFAIK, and both the current documentation and the changes currently suggested in this ticket don't address that. I'd recommend trying to do something here. (From an architectural/implementation perspective, these aggregation stages represent the "first" part of a new data source retrieval in the overall aggregation pipeline, so that word kind of still applies here.  But that's way too much of a nuanced/implementation detail to surface in the docs I think).
Comment by Nick Villahermosa [ 07/Sep/22 ]

The second statement is about $group using an index for optimization. If the $sort comes first, doesn't that invalidate "$group can potentially use an index..."?

Regarding reordering, I don't see $group mentioned in the linked article, and it looks like $sort is only moved in relation to $match, not $group. You'd still have:

  1. $sort the fields to group by
  2. $group the fields, using an index to optimize

Just trying to get this right.

Comment by Alex Bevilacqua [ 07/Sep/22 ]

$sort can use an index if $sort is not preceded by a $project$unwind, or $group stage.

I think the current guidance is misleading. The initial $sort CAN utilize an index if it is the FIRST stage of the pipeline, which this ticket is trying to clarify. I believe part of the current documentation might stem from the pipeline optimizations that can reorder stages under certain circumstances.

$group can potentially use an index to find the first document in each group if: $group is preceded by $sort that sorts the field to group by, and...

 

The sentence reads "if $group is preceded by $sort", so in this case the $sort would still come first

christopher.harris@mongodb.com can you give https://github.com/10gen/docs-mongodb-internal/pull/1774/files a quick look to make sure the appropriate messaging is being provided via this ticket?

Comment by Nick Villahermosa [ 07/Sep/22 ]

alex.bevilacqua@mongodb.com, need a quick tech review: "only the first pipeline stage" is at odds with the documented behavior for the $group and $sort stages. Is that documentation incorrect? I haven't yet gone through enough of the tech training to rely on my own testing for this.

> $sort can use an index if $sort is not preceded by a $project, $unwind, or $group stage.

(Indicates it can be preceded by another type of stage, and wouldn't be first)

> $group can potentially use an index to find the first document in each group if: $group is preceded by $sort that sorts the field to group by, and...

(specifically requires that it isn't the first stage)

Generated at Thu Feb 08 08:12:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.