[SERVER-68623] Executing $sample from mongos on sharded collection has unexpected behavior Created: 08/Aug/22 Updated: 28/Sep/22 Resolved: 28/Sep/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Distributed Query Execution |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | beat jean | Assignee: | Denis Grebennicov |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | QE 2022-09-19, QE 2022-11-14 | ||||||||
| Participants: | |||||||||
| Description |
|
According to https://www.mongodb.com/docs/manual/reference/operator/aggregation/sample/ $sample will behave differently depending on the parameters passed – random sort or random cursor. But the behavior described in this docs only applies to executing $sample from mongod. Executing $sample from mongos to a sharded collection does not behave as described in the docs. Because:
Here comes the problem. Execute $sample from mongos, and the sample size is 5% of the total number of documents in a sharded collection. It is expected to use the random cursor method, but in fact, the random sort method will be used to do the sample on the shard svr.
|
| Comments |
| Comment by Sebastien Mendez [ 28/Sep/22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We confirm that there is no bug to fix, so we're closing this SERVER ticket. Regards, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Denis Grebennicov [ 19/Sep/22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
You are right beatjean1314@gmail.com. I guess one should adjust the documentation saying that the decision making process (which sampling strategy to pick) is done based on local information of every corresponding shard. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by beat jean [ 18/Sep/22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, I mean for mongos, numberOfRecords is 151 and sample size is 5. According to the docs https://www.mongodb.com/docs/manual/reference/operator/aggregation/sample/, all three conditions are met, then random cursor should be used for sampling, but shard B still uses random sorting for sampling. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Denis Grebennicov [ 14/Sep/22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I just wrote a sample test to exercise different setups and explain outputs for $sample stage in sharded cluster for sharded collections. Prior to executing $sample stage on every shard, one will check for the conditions for the random cursor. The numberOfRecords will be used based on the number of records stored on that particular mongod instance and not the total number of records on all shards. Meaning that if we have a 151 documents, where 101 are stored on shard A and 50 are stored on shard B, then shard A will perform sampling through the help of random cursor, while shard B will read all documents from the collection and perform random sorting. Regardless of the work performed on the shard, the merged step will perform the random sorting of the results from both shards and taking only (limit) the number of documents specified in the $sample.size. When the collection is not sharded, then $sample stage behaves as it would be in a regular replica-set/standalone setup. Does this answer the quesion chris.kelly@mongodb.com beatjean1314@gmail.com? Here you can see the explain output:
|