[SERVER-78764] Revisit sampling CE scan-from-start in performance variants Created: 07/Jul/23  Updated: 19/Jan/24  Resolved: 19/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Alya Berciu Assignee: Backlog - Query Optimization
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-11815 Implement repeatable random cursor Open
depends on SERVER-82887 [CQF] Make sampling in chunks the sam... Closed
Assigned Teams:
Query Optimization
Participants:

 Description   

In order to reduce noise in sampling variants, we opted to scan from the start of the collection (rather than taking a true random sample) in our performance tests. We should revisit this approach.



 Comments   
Comment by David Percy [ 19/Jan/24 ]

It sounds like when we set up these tests, we had decided to use sequential sampling because:

  • We want to minimize noise in the tests, and saw that random sampling made them noisier.
  • Given that the test data is random (with no correlation between value and position), the estimates we get from sequential sampling vs random sampling will be similar.

We considered doing random sampling with a fixed seed, but it uses a WiredTiger random cursor, which doesn't support repeatability / fixed seed: WT-11815.  (And repeatability requires more than a fixed seed, because the result also depends on the tree shape for example.)

Generated at Thu Feb 08 06:39:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.