[SERVER-45474] $sample doesn't support variables or expression operators Created: 10/Jan/20  Updated: 23/Jun/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.2.2
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Luke Prochazka Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 3
Labels: qopt-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-38802 Add more granular options to $sample Backlog
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   

The current $sample stage only accept a number type for the size parameter, per: https://github.com/mongodb/mongo/blob/master/src/mongo/db/pipeline/document_source_sample.cpp#L102

It would be nice to allow variables/expressions in the same way that they are almost universally available within aggregation.



 Comments   
Comment by Luke Prochazka [ 04/Oct/21 ]

asya, the use case I had in mind was to provide the ability to dynamically supply the sample size paramater as derived from the collection size. Ie to sample by a proportionate size rather than a fixed size irrespective of the collection size, as it makes more statistical sense this way and offers more control over the storage engine behaviour.

Here is a sample shell procedure for illustration purposes:
 

// $sample expression testing
var dbName = 'database',
    collName = 'collection',
    sampleSize = 4.9; // percentage
var options = {
    "allowDiskUse": true,
    "cursor": {},
    "readConcern": { "level": "local" },
    "comment": "$sample expression testing",
    "let": { // Added in MongoDB 5.0
        "sampleN": ((sampleSize/100) * db.getSiblingDB(dbName).getCollection(collName).stats().count)|0
    }
};
var pipeline = [{
    // "$sample": { "size": 1000 } // factory default works as expected
    // "$sample": { "size": "$$sampleN" } // fails
    "$sample": { "size": { "$expr": "$$sampleN" } } // fails
}];
 
db.getSiblingDB(dbName).getCollection(collName).aggregate(pipeline, options);

Resulting in the message: "MongoServerError: size argument to $sample must be a number"

The reproduction script above leverages the new v5.0 let aggregation option, which postdates this Jira's initial request, though provides a more simplistic and elegant example for illustration purposes. A pre-5.0 version would involve a more convoluted pipeline involving an uncorrelated $lookup subquery to $collStats or similar technique to push the derived document count ahead of the initial $sample.

Having the ability to perform the above would for example go a long way to obviating the need for SERVER-38802.
 

Comment by Asya Kamsky [ 25/May/21 ]

luke.prochazka

$sample applies to the entire incoming stream of documents, so what expression would it be useful to use here?

Generated at Thu Feb 08 05:08:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.