[SERVER-69257] COUNT_SCAN plan not selected when unfiltered $group + $sum on _id performed Created: 30/Aug/22  Updated: 31/Oct/23

Status: Blocked
Project: Core Server
Component/s: Query Execution
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alex Bevilacqua Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER-69257-bench.js    
Issue Links:
Depends
depends on SERVER-23406 index scan is slower than full collec... Backlog
Problem/Incident
causes RUBY-3064 countDocuments() executes aggregation... Closed
Related
is related to SERVER-57518 16% performance loss switching from C... Backlog
is related to SERVER-29444 Use a covered and streaming plan for ... Backlog
Assigned Teams:
Query Optimization
Operating System: ALL
Sprint: QO 2022-09-19, QO 2023-08-07, QO 2023-08-21
Participants:

 Description   

As of DRIVERS-501 the driver CRUD specification defines a pipeline to be used to count documents without resorting to the count command (due to potential inaccurate counts without query predicates).

db.foo.drop();
db.foo.insertMany([{},{},{},{},{}]);
 
// Driver CRUD spec definition for countDocuments ////////////////////
filter = {}
skip   = null
limit  = null
 
pipeline = [{'$match': filter}]
if (skip) {
  pipeline.push({'$skip': skip})
}
if (limit) {
  pipeline.push({'$limit': limit})
}
pipeline.push({'$group': {'_id': 1, 'n': {'$sum': 1}}})
//////////////////////////////////////////////////////////////////////
 
db.foo.explain("executionStats").aggregate(pipeline)

Without specifying a filter the empty $match stage would result in a COLLSCAN plan winning, though a COUNT_SCAN should be appropriate.

/*
"nReturned": 5,
"executionTimeMillis": 0,
"totalKeysExamined": 0,
"totalDocsExamined": 5,
"executionStages": {
  "stage": "COLLSCAN",
*/

This could be worked around by unshifting a {$sort: { _id: 1 }} stage to the pipeline, however after discussing with david.storch@mongodb.com this may be addressable server-side.



 Comments   
Comment by Peter Volk [ 15/Aug/23 ]

Marking this as blocked by SERVER-23406 as the current performance numbers don't allow this patch to be merged as long as the index scans are significantly slower than a col scan. 

Comment by Milena Ivanova [ 07/Aug/23 ]

Local benchmark shows that for collections with a relatively small number of fields the collection scan is more efficient than the COUNT_SCAN over the _id index. Bellow are results for collections of size 10000, 100000, and 1 000 000. The documents have 6 fields {_id: i, a: i * 5, b: 100, c: 1000, d: "xxx", e: "abc".repeat(100) }. Each test was run 7 times and the first and last measurements were removed when computing the average. The following results were obtained when running with the SBE engine.

Running for collection size 10000
Testing scan for hint { "$natural" : 1 }
Average execution time (ms): 3 Median execution time (ms): 3
Testing scan for hint { "_id" : 1 }
Average execution time (ms): 5 Median execution time (ms): 5
Testing scan for hint { "a" : 1 }
Average execution time (ms): 4 Median execution time (ms): 4
 
 
Running for collection size 100000
Testing scan for hint { "$natural" : 1 }
Average execution time (ms): 29.6 Median execution time (ms): 30
Testing scan for hint { "_id" : 1 }
Average execution time (ms): 47 Median execution time (ms): 47
Testing scan for hint { "a" : 1 }
Average execution time (ms): 42 Median execution time (ms): 42
 
 
Running for collection size 1000000
Testing scan for hint { "$natural" : 1 }
Average execution time (ms): 304.6 Median execution time (ms): 305
Testing scan for hint { "_id" : 1 }
Average execution time (ms): 470.6 Median execution time (ms): 468
Testing scan for hint { "a" : 1 }
Average execution time (ms): 415.2 Median execution time (ms): 415 

Similar results for collections with 3 fields, one idea faster over the large collection sizes.

Comment by Charlie Swanson [ 06/Sep/22 ]

This does seem to be pretty easily addressed. This is a pretty hack-y patch I drummed up which seems to "fix" it and prefer a full COUNT_SCAN of the "_id" index over a COLLSCAN:

diff --git a/src/mongo/db/query/query_planner.cpp b/src/mongo/db/query/query_planner.cpp
index ba398f446cb..4e23e3aa435 100644
--- a/src/mongo/db/query/query_planner.cpp
+++ b/src/mongo/db/query/query_planner.cpp
@@ -1494,6 +1494,17 @@ StatusWith<std::vector<std::unique_ptr<QuerySolution>>> QueryPlanner::plan(
         }
     }
 
+    // If there are no indexed alternatives, but this is a count request without a query predicate,
+    // we can use a COUNT_SCAN of the entire _id index (if it exists).
+    if (out.empty() && (params.options & QueryPlannerParams::IS_COUNT) &&
+        query.getQueryObj().isEmpty()) {
+        for (auto&& index : params.indices) {
+            if (isIdIndex(index.keyPattern)) {
+                out.push_back(buildWholeIXSoln(index, query, params));
+            }
+        }
+    }
+
     // Check whether we're eligible to use the columnar index, assuming no other indexes can be
     // used.
     if (out.empty()) {

I'll nominate this for a quick win. It won't be super easy since I believe we'll want to benchmark it and possibly consider whether a non-id index should be eligible? Only non-partial, non-multi key indexes would work. I'm not sure how we'd effectively choose amongst them though, so the "_id" index may be the sensible default.

Comment by Charlie Swanson [ 06/Sep/22 ]

Oops! Disregard my last comment. I confused myself by looking at the example in that linked ticket, which does not include a filter or a field path access:

db.items.aggregate([ {$group: {_id: category, count: {$sum: 1}}}])

It was later clarified that this was a typo though and "category" is meant to be a field path reference:

db.items.aggregate([ {$group: {_id: "$category", count: {$sum: 1}}}])

That's a pretty different query than the pure count in this ticket, so I will re-open and undo the duplication. I'll also fix up the example on that ticket

Comment by Charlie Swanson [ 06/Sep/22 ]

Thanks christopher.harris@mongodb.com and nicholas.zolnierz@mongodb.com. This does appear to be a duplicate of SERVER-29444. The empty predicate is the same as no $match, and the two are both impacted by this bug of not generating an entire-index COUNT_SCAN. The ticket christopher.harris@mongodb.com linked is worth keeping in mind for any change like this, but I think described a filtered COUNT, which makes it eligible for index scans.

Generated at Thu Feb 08 06:13:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.