-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Truncate
-
Storage Engines - Foundations
-
187.774
-
None
-
None
Background
During step-up, the multi-pass drain algorithm in _layered_drain_ingest_table_and_truncate_list interleaves ingest copies with range truncate replays. The follower records truncates in two places: the truncate list (for replaying against stable at step-up) and the ingest btree itself (tombstones written via _wt_layered_truncate so the follower's own reads see the correct deleted state).
Problem
Tombstones written to the ingest btree by _wt_layered_truncate are structurally identical to tombstones written by explicit user removes (cursor->remove). During drain, _layered_copy_ingest_table cannot distinguish between the two.
This causes a correctness bug when an ingest update at the same timestamp as a truncate should survive the truncate. For example:
- write(key=300, ts=15) → ingest: [update@15]
- truncate([100,700], ts=15) → ingest: [tombstone@15, update@15]; truncate list: T1@ts=15
After drain:
- copy_ingest(NONE, 15) copies [tombstone@15, update@15] to stable
- apply_truncate(ts=15) prepends another tombstone → stable: [tombstone@15(trunc), tombstone@15(ingest), update@15]
The update@15 is permanently buried. Any attempt to fix the drain ordering (e.g. apply truncate first, then copy ingest) runs into the same problem from the other direction: copying the ingest tombstone after the truncate tombstone produces consecutive same-timestamp tombstones or anomalous tombstone-value-tombstone chains that confuse reconciliation and history store.
Proposed Fix
When _wt_layered_truncate writes tombstones to the ingest btree, mark them with a new flag (e.g. WT_UPDATE_FROM_TRUNCATE or reuse an existing internal flag). During _layered_copy_ingest_table, skip flagged tombstones — they are subsumed by the corresponding truncate list entry which will replay the correct range tombstone against stable. Unflagged tombstones (from explicit removes) are copied normally.
This eliminates the ambiguity at the source and makes all drain ordering approaches safe.
Related
Part of WT-16814 (step-up fast truncate new design).
- is related to
-
WT-16814 Implement fast truncate ingest drain on step up
-
- In Code Review
-