Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-13065

Investigate changes in SERVER-9507: Optimize $sort+$group+$first pipeline to avoid full index scan

      Description

      Downstream Change Summary

      The section titled "Pipeline Operators and Indexes" from https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes should be updated due to this change. It currently lists $match, $sort, and $geoNear as eligible to result in index use. However, this list is not exhaustive. Due to this change, a pipeline with a $group at the beginning can use an index, even if there is no $sort stage.

      In addition to documenting this new behavior, I suggest that we change the language of this page so that it does not claim to be exhaustive. That is, it should not say "Stages X, Y, and Z can result in index use." Instead, it should say something like "MongoDB's query planner analyzes an aggregation pipeline in order to determine whether indexes can be used to accelerate the operation. For example, an index can be used for filtering if a $match is at the beginning of the pipeline, or can be moved to the beginning of the pipeline by the optimizer. Similarly, a $sort at the beginning of the pipeline can be computed by scanning an index in order. As a final example, $group stages which obtain the distinct values of a field can use an index for the distinct operation if they occur at the beginning of the pipeline."

      Description of Linked Ticket

      This is an analogue to SERVER-2094 ("distinct cheat with indexes"), but for the aggregation framework.

      This performance improvement is to allow $group operators like $first to be able to take advantage of the fact that the input to the pipeline is sorted, and thus reduce the number of index entries scanned by "skipping" processing of large portions of the pipeline.

      For example, suppose a user has a collection with an index {x:1,y:1}, and that x has low cardinality. Consider the following pipeline:

      db.foo.aggregate({$sort:{x:1,y:1}},{$group:{_id:{x:"$x"},y:{$first:"$y"}}})

      Currently, the above pipeline will perform a full scan of the index. After this optimization, the above pipeline will only have to scan on the order of |x| index entries, which is much smaller than the size of the index.

      This ticket is filed as a result of discussion in SERVER-9272 (full use case available there).

      Scope of changes

      Impact to Other Docs

      MVP (Work and Date)

      Resources (Scope or Design Docs, Invision, etc.)

            Assignee:
            jeffrey.allen@mongodb.com Jeffrey Allen
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved:
              4 years, 30 weeks ago