The $sample stage returns a different sample every time it runs. $lookup sometimes re-runs the inner pipeline per outer document, and sometimes runs it only once. This makes the behavior of $sample inside $lookup hard to predict.
For example, this query runs the sub-pipeline only once, resulting in the same sample chosen every time:
{$lookup: {
from: 'foreign_coll',
pipeline: [
{$sample: {size: 5}},
],
as: 'docs',
}}
On the other hand, this query re-runs the sub-pipeline, choosing a different sample per outer document:
{$lookup: {
from: 'foreign_coll',
let: {outer: "$_id"},
pipeline: [
{$match: {$expr: {$lt:["$_id", "$$outer"]}}}, // correlation predicate
{$sample: {size: 3}},
],
as: 'docs',
}}
Since we consider DocumentSourceSequentialDocumentCache to be an optimization, there could be other exceptions to this rule. For example, if you add a dummy correlation hoping to force the inner pipeline to re-run, it can get optimized out.
This ticket will make changes to consider any $sample stage or stage containing a $rand or $sampleRate expression to be ineligible for uncorrelated pipeline caching.