[SERVER-27637] Merging phase of a distributed $sample should recognize input streams are already sorted Created: 11/Jan/17  Updated: 14/Aug/17  Resolved: 14/Aug/17

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Bernard Gorman
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-22760 Sharded aggregation pipelines which i... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2017-08-21
Participants:

 Description   

When issuing an aggregation starting with a $sample of size N on a sharded collection, we split the $sample into two parts: First to gather a sample of size N on each shard, and second to merge the samples together for a final sample which potentially includes documents from each shard. The first part can be achieved by either (1) doing a full sort of all documents based on injected random values, then taking the top N or (2) doing repeated random cursor walks over an index until we get N unique documents. During approach (2) we inject random values into the documents after-the-fact in such a way that the output documents are still in decreasing order of random value.

In either case, the documents output from each shard will be in order by a random metadata field, and just need to be merged in a "merge sorted streams" style. This makes the merging half of the $sample stage equivalent to the merging half of a $sort stage, so when splitting a $sample stage we generate a {sample: {size: N}} to run on all the shards, and a {$sort: {sortKey: {$computed0: {meta: "randVal"}}}} to run on the merging shard. This $sort stage should also include the "mergingPresorted" option, so that it can take advantage of the fact that the inputs are already sorted and avoid the need to spill to disk.

The net impact of this is that today, a $sample issued against a sharded collection can error with a message indicating that the user needs to pass 'allowDiskUse: true' to perform the sort on the merging shard, even though no disk use is required. It should also speed up all distributed {{$sample}}s if we take advantage of the already sorted streams.



 Comments   
Comment by Bernard Gorman [ 08/Aug/17 ]

This issue was resolved with this commit as part of the work on SERVER-22760.

Generated at Thu Feb 08 04:15:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.