-
Type:
Epic
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Block Manager
-
None
-
Storage Engines - Persistence
-
209.776
-
SE Persistence backlog
-
None
-
Implement augmented skiplist
Background
Customers are experiencing compaction slowness. Investigation identified the root cause as {}block_first_srch, which runs in linear time. During compaction, blocks are relocated from the last 10–20% of the file into the beginning, which creates progressively more and smaller holes. As the number of holes grows, }}{_}{{_block_first_srch takes longer and longer because it walks the entire avail offset skiplist linearly to find the first hole that can fit the requested size.
Proposed Fix
Augment the offset skiplist with per-extent range information: each extent node stores the maximum hole size within its span. This allows __block_first_srch to skip entire spans where no hole is large enough, reducing the search from linear to O(log n).
A prototype exists on branch wt-14196-compact-block-first-srch-v2-prototype. It demonstrates correctness but was written as a proof-of-concept and needs to be productionised.
Steps
- Understand the prototype — read wt-14196-compact-block-first-srch-v2-prototype to fully grasp the augmented skiplist design and __block_first_srch_v2.
- Rebase onto develop — the prototype branch is significantly behind develop; merge or rebase to bring it up to date and resolve conflicts.
- Code cleanup — clean the PoC up to production quality:
- Conform to WiredTiger C coding style and conventions
- Remove PoC comments and dead code
- Ensure dist/s_all passes cleanly
- Code improvements — make the following targeted improvements:
- The max_size array is currently fixed at MAX_DEPTH size; change it to a variable-length array (matching the existing next[] pattern) to avoid wasting memory on shallow nodes
- The fake head node is currently heap-allocated (malloc/free); eliminate this allocation to reduce memory usage and CPU overhead
- Revisit naming: decide whether max_size is the right name for the range metadata field
- Any other small improvements identified during cleanup
- Code review and bug fixes — review the full diff for correctness, edge cases, and bugs; fix anything found.
Acceptance Criteria
- __block_first_srch (or its replacement) runs in O(log n) time on the avail offset skiplist
- All existing tests pass
- Code meets WiredTiger style and review standards
Out of Scope
- Removing the size skiplist / replacing best-fit with first-fit everywhere (phase 2)
- Background compaction changes
- Disaggregated storage (compaction is a no-op there)
- related to
-
WT-16682 Assess disk fragmentation in the fleet
-
- Open
-