[SERVER-48169] Optimize distinct count to avoid materializing any values outside the index Created: 12/May/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework, Performance
Affects Version/s: 4.4.0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: qopt-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-12159 Distinct Count support in Agg pipeline Closed
Assigned Teams:
Query Optimization
Participants:

 Description   

Today we have two separate optimized index scans: COUNT_SCAN and DISTINCT_SCAN. COUNT_SCAN will return simple sentinel values as it scans the index, avoiding the cost of translating the index key format to the format the query plan needs/understands. DISTINCT_SCAN can seek over large sections of the index where the values are identical, but still materializes an object outside the index key for consumption by the query plan. These two optimizations could be combined in the case of a query like

// Assume index {value1: 1, value2: 1, value3: 1} exists.
collection.aggregate([
{ $match: { 
    value1: 1, 
    value2: { $gte: new Date(1000) }
}},
{ $group: { _id: "$value3" } },
{ $count: "distinct" } // field name here doesn't matter
])

This would lead to better performance, unclear how much.


Generated at Thu Feb 08 05:16:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.