[SERVER-27389] Use a covered index scan if $sample is the first stage and only the _id is needed Created: 12/Dec/16  Updated: 19/Jan/23  Resolved: 19/Jan/23

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Query Optimization
Resolution: Won't Do Votes: 0
Labels: neweng, optimization
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Participants:

 Description   

The optimized code path that uses a random cursor to provide a $sample stage currently unconditionally appends a FETCH stage on top of the index scan: https://github.com/mongodb/mongo/blob/r3.4.0-rc2/src/mongo/db/pipeline/pipeline_d.cpp#L253-L254

This is unnecessary in cases like this where we only need the _id to answer the aggregation:

> db.foo.drop();
true
> for (var i = 0; i < 10000; i++) { db.foo.insert({_id: i}); }
WriteResult({ "nInserted" : 1 })
> db.foo.explain().aggregate([{$sample: {size: 10}}, {$bucketAuto: {groupBy: "$_id", buckets: 2}}])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					
				},
				"fields" : {
					"_id" : 1
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.foo",
					"indexFilterSet" : false,
					"winningPlan" : {
						"stage" : "FETCH",  // This FETCH stage is not necessary.
						"inputStage" : {
							"stage" : "INDEX_ITERATOR"
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$sampleFromRandomCursor" : {
				"size" : NumberLong(10)
			}
		},
		{
			"$bucketAuto" : {
				"groupBy" : "$_id",
				"buckets" : 2,
				"output" : {
					"count" : {
						"$sum" : {
							"$const" : 1
						}
					}
				}
			}
		}
	],
	"ok" : 1,
	"operationTime" : Timestamp(0, 0)
}

This optimization is valid if the Pipeline either has no dependencies, or if the only dependency is the _id. To check this, we'll need to move this dependency calculation up to before the handling of a $sample stage. If the only dependency is the _id, we'll need to add a PROJECTION stage to transform the index key into a full-blown document.



 Comments   
Comment by David Storch [ 15/Oct/18 ]

LGTM

Comment by Charlie Swanson [ 15/Oct/18 ]

asya david.storch I was just scanning the neweng tickets and came across this one. Given this is only relevant to MMAPv1, ok to close as "Won't Fix"?

Comment by Charlie Swanson [ 16/Mar/17 ]

On the MMAPv1 storage engine, yes. With WiredTiger, we always fetch since it uses a MULTI_ITERATOR stage, which fetches the full record from the storage engine. I believe this is necessary, since WiredTiger performs a random walk over the RecordId index, which is not the same as the _id, so we need to look up the _id somehow. There may be some optimization to avoid fetching the whole document in that case, but I'd imagine it would be much more complex.

Generated at Thu Feb 08 04:15:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.