[SERVER-53626] Minimize index scanning when retrieving distinct values grouped by more than one field Created: 06/Jan/21  Updated: 21/Jan/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Chris Harris Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 1
Labels: indexv3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-21992 Inconsistent results when grouping by... Backlog
depends on SERVER-14832 Distinct command result set may inclu... Needs Scheduling
depends on SERVER-60977 Make $group _id behavior with null an... Closed
Duplicate
is duplicated by SERVER-56294 Extend group to distinct optimization... Closed
Related
related to SERVER-40090 DISTINCT_SCAN in agg is only used whe... Closed
Assigned Teams:
Query Optimization
Sprint: QO 2021-10-04, QO 2021-10-18
Participants:

 Description   

MongoDB 4.2 introduced the ability to avoid full index scans for aggregation pipelines that include a specific type of $group stage - those that logically request distinct information which can be obtained by scanning a single document.  This was initially implemented via SERVER-9507.

As confirmed by this comment from a related enhancement in SERVER-40090, this optimization currently only applies when the aggregation is grouping on a single field.  We should consider expanding this optimization, or a similar one, to reduce index scanning when the _id specified for the $group is on multiple fields.    



 Comments   
Comment by Katherine Wu (Inactive) [ 26/Oct/21 ]

The proposed change in SERVER-21992 is different from SERVER-60977, and would not help with this ticket. Unfortunately this optimization will be blocked SERVER-14832, or on a new index format that distinguishes between null and missing. Moving back to open.

WIP patch: https://github.com/10gen/mongo/tree/kaywux/SERVER-53626

Comment by Katherine Wu (Inactive) [ 25/Oct/21 ]

The distinct command (and our current version of the index) conflates the null and missing values. However, $group with a mult-field document _id currently distinguishes between these two values. Without SERVER-21992, this optimization would give us different results.

Comment by Ruslan Abdulkhalikov (Inactive) [ 23/Apr/21 ]

As part of this ticket, it would be nice to handle a case of a key based on an array of a single index key:

> db.coll1.explain().aggregate({$group: {_id: ["$a"]}})
... "stage" : "COLLSCAN" ...

the case with an object works fine:

> db.coll1.explain().aggregate({$group: {_id: {fieldA: "$a"}}})
... "stage" : "DISTINCT_SCAN"

Generated at Thu Feb 08 05:31:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.