[SERVER-22573] $sample should stream results when possible Created: 11/Feb/16  Updated: 11/Feb/16  Resolved: 11/Feb/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Matt Kangas Assignee: Charlie Swanson
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-533 Aggregation stage to randomly sample ... Closed
Related
Participants:

 Description   

Please confirm that the `$sample` aggregation operator streams results to the client immediately whenever possible. If not, consider this a request for improvement.

For example:

MongoDB Enterprise > db.serverStatus().storageEngine
{ "name" : "wiredTiger", "supportsCommittedReads" : true }
MongoDB Enterprise > db.serverBuildInfo().version
3.2.1
MongoDB Enterprise > db.users.count()
10000
MongoDB Enterprise > db.users.aggregate([{$sample:{size: 5}}])

Here the sampling implementation will use a WiredTiger random cursor to obtain each document. I wish to confirm that the first sample document identified is available to the client before the full sample set is computed.

Please confirm for each of the sampling mechanism implemented in MongoDB 3.2 (with and without a query predicate), and also sharded cluster behavior.

Use case: MongoDB Compass. We want to permit users to specify large sample sizes in Compass, then progressively display the results to users. Ideally Compass can begin to display an inferred schema after a very small number of documents are received from the server.



 Comments   
Comment by Charlie Swanson [ 11/Feb/16 ]

I'm closing this as a duplicate of SERVER-533, since there is no additional work here. Feel free to re-open or just comment more if there's still more to discuss.

Comment by Charlie Swanson [ 11/Feb/16 ]

I can confirm. When we choose to use the random cursor approach, it will be non-blocking. In your example, the sample of size 5 will use the random cursor. You can confirm this using explain:

> db.users.count()
10000
> db.users.explain().aggregate([{$sample: {size: 5}}])
{
	"waitedMS" : NumberLong(0),
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test.users",
					"indexFilterSet" : false,
					"winningPlan" : {
						"stage" : "FETCH",
						"inputStage" : {
							"stage" : "INDEX_ITERATOR"
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$sampleFromRandomCursor" : {  // This means it's using a random cursor, whereas $sample would mean it's using a random sort (blocking).
				"size" : NumberLong(5)
			}
		}
	],
	"ok" : 1
}

This will always be true if your sample is small enough relative to the collection size (here's the code that decides if we should try a random cursor). The $sampleFromRandomCursor stage is indeed non-blocking, and will return results to the next stage as soon as they are ready. In a sharded cluster, this stage will be split in two, one performed on each shard, and one performed on the merging shard. The ones performed on each shard will still use the same logic to determine if they should use a random cursor or not, depending on how many documents are on that shard. Each stage will output documents that are sorted by an injected value (I won't go into the details of how that's computed). The one on the merging shard will simply merge the pre-sorted streams, which is also non-blocking.

It should be noted that this doesn't necessarily mean the results will be returned to the client immediately. We will still fill up one batch before sending any back in response, and if you have any other blocking stage later in the pipeline (a $sort or a $group), those will block until all documents have been produced.

Comment by Neville Dipale [ 11/Feb/16 ]

Would this work when $sample is the only operation in the pipeline, or would streaming be supported if there are other downstream operations in the pipeline?

Generated at Thu Feb 08 04:00:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.