Resharding uses $sample internally. I.e., it is using a WT random cursor. In a resharding performance test, occasionally the test fails when $sample repeatedly fails to find 100 unique documents.
In this ticket we should reproduce the failure, adding instrumentation to WT as needed, and once we understand the issue find a way to make random cursors behave better in the problem case.
The problem test is the ReshardCollection.yml genny workload. It inserts 100,000 10KB documents split evenly across two shards. It then reshards the cluster while 100 threads perform reads and writes (find and update commands). Resharding tries to get ~200 samples from each shard via $sample. Occasionally, the sample includes duplicate keys. We see an error when 100 consecutive attempts to get ~200 unique keys all fail.
$sample is allowed to return duplicate keys. But given the number of keys and size of the sample, having this happen repeatedly is surprising and undesirable.