Validate layered table content during garbage collection

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Layered Tables, Verify
    • None
    • Storage Engines
    • None
    • None

      In diagnostic mode, when garbage collecting records from an ingest table, WT should also verify that a matching record exists in the corresponding shared table, and panic if this match fails.

      On follower nodes, a layered table accumulates changes in its ingest (L1) table. When it picks up a new checkpoint for the shared (L0) table, the layered table garbage collects data that is no longer needed from the ingest table.

      In other words, if the new checkpoint has was taken at timestamp T, we expect all content in the ingest table with timestamps <= T to be in that checkpoint. So we can remove those records from the ingest table.

      To validate the correctness of layered tables and of the higher query logic (i.e., that the same operations produce the same results on all replicas) it would be useful to verify that the records we prune from the ingest table do, in fact, have matching entries in the current shared checkpoint.

      That is the goal of this ticket.

      A possible embellishment (which could be moved to a separate ticket) would be to perform this check occasionally in release mode. If we, for example, verify 1 in 10,000 records, chosen at random, when garbage collecting ingest tables, it might give us some signal about possible issues. (TBD: What is the right frequency? Should we verify a single value, or an entire chain of values for a key chosen at random? etc.)

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: