npos: WT_NPOS_RIGHT collapses to 1.0 after WT_CLAMP in __wt_page_npos, losing next-page intent at eviction call site

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Done
    • Priority: Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: Cache and Eviction
    • None
    • Environment:
      GCC 13, Ubuntu 24.04
    • Storage Engines - Foundations
    • 907.375
    • SE Foundations - 2026-06-09
    • 1

      Hi,

      I've been looking into the normalized position (npos) subsystem in WiredTiger , specifically __wt_page_npos in bt_npos.c and how eviction uses the returned value in evict_walk.c to save and restore its walk position. While writing a focused repro test, I ran into an observation I'd love to get a read on before drawing any conclusions, since I'm not fully familiar with all the design intent here.

       

      What I was looking at:
      __wt_page_npos returns a normalized floating-point position via WT_CLAMP(npos, 0.0, 1.0) (bt_npos.c, line 168). The constant WT_NPOS_RIGHT is defined as (1. + 1e-8) with the comment "Rightmost position in the current page or next page" (btree.h, line 64). My question coming in was: does that above-1.0 value meaningfully influence the computation, or does the final clamp collapse it to exactly 1.0 and lose that "next page" intent at the call site?

       

      What I added to the existing csuite test:
      The upstream test/csuite/normalized_pos/main.c only exercises a uniform tree (100k keys, 1 key per page, memory_page_max=leaf_page_max=allocation_size=1KB). My repro extends it with a skewed-tree fixture and two additional test functions:

      • create_skewed_btree , in-memory table with memory_page_max=512, leaf_page_max=2KB, internal_page_max=2KB, allocation_size=512, split_deepen_min_child=1, 20k ascending inserts
      • test_npos_right_clamp , checks whether WT_NPOS_RIGHT and 1+1e-5 both collapse to 1.0 after the clamp while WT_NPOS_MID is still below 1.0
      • test_npos_per_key , per-key monotonicity and round-trip check extracted as a standalone function (the upstream version embeds this inside test_normalized_pos)

      The test code and output are attached (edited_main.c, REPRO.txt).

       

      My Observation:
      In the skewed tree run, 3 samples were recorded where mid < 1.0 but both WT_NPOS_RIGHT and 1 + 1e-5 clamped to exactly 1.0. The relevant call site in evict_walk.c (line 93) sets pos = WT_NPOS_RIGHT for a forward-direction internal page and immediately passes it as the start argument to __wt_page_npos, whose return value is always clamped to [0, 1]. So the "next page" intent encoded in the above-1.0 value is not preserved in the stored btree->evict_pos.

      I'm not sure whether callers are expected to handle this case before the clamp, or whether the above-1.0 value is only meaningful during the internal slot computation and the final clamp is intentional. Happy to be corrected here.

      For context on why skewed trees are relevant to this: the asymmetry in internal page slot counts is not an artifact of the synthetic config. It comes from __split_internal in bt_split.c: lines 980-981 compute chunk = pindex->entries / children and remain = pindex->entries - chunk * (children - 1), and lines 987-993 describe the resulting right-split, the original page retains the uniform chunk on the left while the partially-filled remainder moves to a new right-side child. With ascending inserts, MongoDB's default _id (ObjectId) workload, this right-side child is perpetually underfull relative to settled left-side pages. Setting split_deepen_min_child=1 makes this trigger deterministically at small scale; the default value of 10,000 shifts the threshold but does not eliminate the structural asymmetry.

       

      What I'm not claiming:
      Monotonicity of npos(key) was not broken, the per-key round-trip test passed for all 20,000 keys with both __wt_page_from_npos_for_read and __wt_page_from_npos_for_eviction, and the uniform baseline is clean. The question is narrower: whether the "next page" intent of WT_NPOS_RIGHT surviving the clamp matters for correctness at the eviction call site, or whether the current behavior is by design.

       

      Possible direction (open to feedback):
      One thought I had: if eviction stored a more direct page reference (or a fallback cursor position) alongside the clamped npos value, the walk could recover from cases where the npos step doesn't advance to a new page. But I'm flagging this as a discussion point rather than a proposed fix, since I don't know whether there are memory or locking constraints that make that approach impractical.

       

      I'd really appreciate any context on whether this is expected behavior, a known limitation, or something worth digging into further.

        1. REPRO.txt
          1 kB
        2. edited_main.c
          14 kB

            Assignee:
            Yury Ershov
            Reporter:
            Tuna KARABACAK (EXT)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: