Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Layered Tables, Verify
Labels:
- ds_durability_high_risk
- ds_durability_mitigation

Epic Link:
Verify stability in disagg
Assigned Teams:

Storage Engines - Persistence
Sprint:
SE Persistence backlog
Story Points:
None

In diagnostic mode, when garbage collecting records from an ingest table, WT should also verify that a matching record exists in the corresponding shared table, and panic if this match fails.

On follower nodes, a layered table accumulates changes in its ingest (L1) table. When it picks up a new checkpoint for the shared (L0) table, the layered table garbage collects data that is no longer needed from the ingest table.

In other words, if the new checkpoint has was taken at timestamp T, we expect all content in the ingest table with timestamps <= T to be in that checkpoint. So we can remove those records from the ingest table.

To validate the correctness of layered tables and of the higher query logic (i.e., that the same operations produce the same results on all replicas) it would be useful to verify that the records we prune from the ingest table do, in fact, have matching entries in the current shared checkpoint.

That is the goal of this ticket.

A possible embellishment (which could be moved to a separate ticket) would be to perform this check occasionally in release mode. If we, for example, verify 1 in 10,000 records, chosen at random, when garbage collecting ingest tables, it might give us some signal about possible issues. (TBD: What is the right frequency? Should we verify a single value, or an entire chain of values for a key chosen at random? etc.)

Assignee:: Sean Watt
Reporter:: Keith Smith
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Sep 15 2025 03:45:55 PM UTC
Updated:: Feb 24 2026 09:14:51 AM UTC

Details

Description

Attachments

Activity

People

Dates