[SERVER-22573] $sample should stream results when possible Created: 11/Feb/16 Updated: 11/Feb/16 Resolved: 11/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 3.2.1 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Matt Kangas | Assignee: | Charlie Swanson |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Please confirm that the `$sample` aggregation operator streams results to the client immediately whenever possible. If not, consider this a request for improvement. For example:
Here the sampling implementation will use a WiredTiger random cursor to obtain each document. I wish to confirm that the first sample document identified is available to the client before the full sample set is computed. Please confirm for each of the sampling mechanism implemented in MongoDB 3.2 (with and without a query predicate), and also sharded cluster behavior. Use case: MongoDB Compass. We want to permit users to specify large sample sizes in Compass, then progressively display the results to users. Ideally Compass can begin to display an inferred schema after a very small number of documents are received from the server. |
| Comments |
| Comment by Charlie Swanson [ 11/Feb/16 ] | |||||||||||||||||||||||||||||||||
|
I'm closing this as a duplicate of | |||||||||||||||||||||||||||||||||
| Comment by Charlie Swanson [ 11/Feb/16 ] | |||||||||||||||||||||||||||||||||
|
I can confirm. When we choose to use the random cursor approach, it will be non-blocking. In your example, the sample of size 5 will use the random cursor. You can confirm this using explain:
This will always be true if your sample is small enough relative to the collection size (here's the code that decides if we should try a random cursor). The $sampleFromRandomCursor stage is indeed non-blocking, and will return results to the next stage as soon as they are ready. In a sharded cluster, this stage will be split in two, one performed on each shard, and one performed on the merging shard. The ones performed on each shard will still use the same logic to determine if they should use a random cursor or not, depending on how many documents are on that shard. Each stage will output documents that are sorted by an injected value (I won't go into the details of how that's computed). The one on the merging shard will simply merge the pre-sorted streams, which is also non-blocking. It should be noted that this doesn't necessarily mean the results will be returned to the client immediately. We will still fill up one batch before sending any back in response, and if you have any other blocking stage later in the pipeline (a $sort or a $group), those will block until all documents have been produced. | |||||||||||||||||||||||||||||||||
| Comment by Neville Dipale [ 11/Feb/16 ] | |||||||||||||||||||||||||||||||||
|
Would this work when $sample is the only operation in the pipeline, or would streaming be supported if there are other downstream operations in the pipeline? |