Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-11815

Implement repeatable random cursor

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • StorEng - Refinement Pipeline

      It would be useful for some server testing to be able have reproducible behavior from random cursors. I.e., to run the same test on the same data and have a random cursor return the same results.

      One particular use case is for cardinality estimation. As I understand it, query planning sometimes wants to estimate the number (or fraction) of entries in a collection that match some predicate. So they do this by sampling the collection with a random cursor and estimating based on the fraction of the sample that matches the predicate.

      Implementing this is harder than just providing the same random seed to different random cursors. The behavior of a random cursor depends on both the random numbers and the state of the tree – the number of inserted values (which changes dynamically), the shape of the tree, and which pages are in memory and which are on disk.

      For the proposed use case, it sounds like it would be OK to assume the tree isn't changing. When using a repeatable random cursor, we could ensure that we make the same decisions regardless of which pages are cached.

      The hard part would be to ensure that we have the same tree shape. I.e., if a test inserts N keys, then uses a random cursor to sample those keys. How do we make sure that we get the same behavior in a different test run that inserts the same N keys in a table. The resulting tree contents will be the same. But the tree shape is likely to be different as it depends on non-deterministic things like when checkpoints run and when pages split. Possibly we could handle this by further restricting the usage to tables where all of the inserts happened via bulk load. But I don't know if this would be too constraining.

      This request came from a discussion with anton.korshunov@mongodb.com and the EMEA query team.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: