The $sample stage returns a different sample every time it runs. $lookup sometimes re-runs the inner pipeline per outer document, and sometimes runs it only once. This makes the behavior of $sample inside $lookup hard to predict.
For example, this query runs the sub-pipeline only once, resulting in the same sample chosen every time:
{$lookup: { from: 'foreign_coll', pipeline: [ {$sample: {size: 5}}, ], as: 'docs', }}
On the other hand, this query re-runs the sub-pipeline, choosing a different sample per outer document:
{$lookup: { from: 'foreign_coll', let: {outer: "$_id"}, pipeline: [ {$match: {$expr: {$lt:["$_id", "$$outer"]}}}, // correlation predicate {$sample: {size: 3}}, ], as: 'docs', }}
Since we consider DocumentSourceSequentialDocumentCache to be an optimization, there could be other exceptions to this rule. For example, if you add a dummy correlation hoping to force the inner pipeline to re-run, it can get optimized out.
This ticket will make changes to consider any $sample stage or stage containing a $rand or $sampleRate expression to be ineligible for uncorrelated pipeline caching.