-
Type:
Task
-
Resolution: Done
-
Priority:
Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: Cache and Eviction
-
None
-
Environment:GCC 13, Ubuntu 24.04
-
Storage Engines - Foundations
-
906.293
-
SE Foundations - 2026-06-09
-
1
Hi,
I've been looking into the normalized position (npos) subsystem in WiredTiger , specifically __wt_page_npos in bt_npos.c and how eviction uses the returned value in evict_walk.c to save and restore its walk position. While writing a focused repro test, I ran into an observation I'd love to get a read on before drawing any conclusions, since I'm not fully familiar with all the design intent here.
What I was looking at:
__wt_page_npos returns a normalized floating-point position via WT_CLAMP(npos, 0.0, 1.0) (bt_npos.c, line 168). The constant WT_NPOS_RIGHT is defined as (1. + 1e-8) with the comment "Rightmost position in the current page or next page" (btree.h, line 64). My question coming in was: does that above-1.0 value meaningfully influence the computation, or does the final clamp collapse it to exactly 1.0 and lose that "next page" intent at the call site?
What I added to the existing csuite test:
The upstream test/csuite/normalized_pos/main.c only exercises a uniform tree (100k keys, 1 key per page, memory_page_max=leaf_page_max=allocation_size=1KB). My repro extends it with a skewed-tree fixture and two additional test functions:
- create_skewed_btree , in-memory table with memory_page_max=512, leaf_page_max=2KB, internal_page_max=2KB, allocation_size=512, split_deepen_min_child=1, 20k ascending inserts
- test_npos_right_clamp , checks whether WT_NPOS_RIGHT and 1+1e-5 both collapse to 1.0 after the clamp while WT_NPOS_MID is still below 1.0
- test_npos_per_key , per-key monotonicity and round-trip check extracted as a standalone function (the upstream version embeds this inside test_normalized_pos)
The test code and output are attached (edited_main.c, REPRO.txt).
My Observation:
In the skewed tree run, 3 samples were recorded where mid < 1.0 but both WT_NPOS_RIGHT and 1 + 1e-5 clamped to exactly 1.0. The relevant call site in evict_walk.c (line 93) sets pos = WT_NPOS_RIGHT for a forward-direction internal page and immediately passes it as the start argument to __wt_page_npos, whose return value is always clamped to [0, 1]. So the "next page" intent encoded in the above-1.0 value is not preserved in the stored btree->evict_pos.
I'm not sure whether callers are expected to handle this case before the clamp, or whether the above-1.0 value is only meaningful during the internal slot computation and the final clamp is intentional. Happy to be corrected here.
For context on why skewed trees are relevant to this: the asymmetry in internal page slot counts is not an artifact of the synthetic config. It comes from __split_internal in bt_split.c: lines 980-981 compute chunk = pindex->entries / children and remain = pindex->entries - chunk * (children - 1), and lines 987-993 describe the resulting right-split, the original page retains the uniform chunk on the left while the partially-filled remainder moves to a new right-side child. With ascending inserts, MongoDB's default _id (ObjectId) workload, this right-side child is perpetually underfull relative to settled left-side pages. Setting split_deepen_min_child=1 makes this trigger deterministically at small scale; the default value of 10,000 shifts the threshold but does not eliminate the structural asymmetry.
What I'm not claiming:
Monotonicity of npos(key) was not broken, the per-key round-trip test passed for all 20,000 keys with both __wt_page_from_npos_for_read and __wt_page_from_npos_for_eviction, and the uniform baseline is clean. The question is narrower: whether the "next page" intent of WT_NPOS_RIGHT surviving the clamp matters for correctness at the eviction call site, or whether the current behavior is by design.
Possible direction (open to feedback):
One thought I had: if eviction stored a more direct page reference (or a fallback cursor position) alongside the clamped npos value, the walk could recover from cases where the npos step doesn't advance to a new page. But I'm flagging this as a discussion point rather than a proposed fix, since I don't know whether there are memory or locking constraints that make that approach impractical.
I'd really appreciate any context on whether this is expected behavior, a known limitation, or something worth digging into further.