Add software prefetch pipeline to WiredTiger B-tree search and cursor reopen path

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines - Transactions
    • None
    • None

      Summary:

      The leaf-page binary search in __wt_row_search is severely memory-bound on Graviton4 (IPC 0.69, 61% backend-bound, L1D MPKI 54). Each iteration incurs two serial cache misses, one for the pg_row entry and one for the key data it points to. This stalls the CPU pipeline for hundreds of cycles. This patch introduces a 2-stage software prefetch pipeline that hides both miss latencies, along with complementary prefetches in the cursor reopen path and a TSAN correctness fix.

      Problem:
      On YCSB 100% read (in-cache, Graviton4 96-core, 128 threads, 3-node replica set), __wt_row_search is the hottest WiredTiger function. ARM Total Performance profiling shows:

      • __wt_row_search: IPC 0.692, Backend Bound 61.3%, L1D MPKI 54.0, L1D Demand MPKI 48.4. The binary search loop stalls on two serial cache misses per iteration (load pg_row[mid], then load key data), each costing ~200+ cycles from L3/DRAM.
      •  __wt_row_leaf_key: IPC 0.500, 100% backend-bound, SPE average load latency 85.5 cycles, L1C miss slowdown 11.97x. This function is completely dominated by cache misses when decoding key data.
      • __curfile_reopen: CPI 4.471, L1D MPKI 114.6, SPE average load latency 85.8 cycles. The cursor reopen path stalls on a cold dhandle->handle (WT_BTREE) cache line.
      • __wt_session_lock_dhandle: The fast-path trylock reads dhandle->excl_session without TSAN suppression, triggering CI failures on every uncontended lock acquisition despite being race-free by construction.

      The existing single-level key prefetch (1 iteration ahead) provides only ~20-40 cycles of runway. This is insufficient to hide an L3 miss on Graviton4's cache hierarchy (L1d 64KB, L2 2MB, L3 36MB shared).

      Proposed Fix:

      See changes in this patch

      Four changes across 4 files (row_srch.c, cur_file.c, cur_std.c, session_dhandle.c):

      1. 2-stage prefetch pipeline in leaf binary search (row_srch.c: __wt_row_search):

      • Level 2: Prefetch pg_row struct entries 2 iterations ahead via new {}row_search_prefetch_row_entry(). Brings the WT_ROW.{_}_key pointer into L1 so the subsequent key decode doesn't stall.
      • Level 1: Existing __row_search_prefetch_key() prefetches key data 1 iteration ahead.
      • Guarded by #if WT_ROW_SEARCH_PREFETCH_DEPTH >= 2 (tunable at build time, default 2). Only fires when limit > 6 to avoid wasted prefetches on final iterations.
      • Applied to all three leaf binary search variants (short key fast path, collator path, general path).

      2. Internal page WT_REF prefetch (row_srch.c: __wt_row_search):

      • New __row_search_prefetch_ref() prefetches pindex->index[slot] (WT_REF entries) for the next two binary search candidates during tree descent, applied to all three internal page search variants (short key, collator, general). Fires when limit > 2.
         

      3. Cursor dhandle->handle prefetch (cur_file.c: __curfile_reopen, cur_std.c: __wt_cursor_cache_get):

      • Adds __builtin_prefetch(dhandle->handle, 0, 3) alongside the existing dhandle->rwlock write-intent prefetch. Warms the WT_BTREE struct into L1 before the cursor needs it.
         

       Impact:

      Validated with ARM TP (topdown + SPE) and DSI on-CPU profiling (perf record) on Graviton4:

      Metric Baseline With Patch Delta
      __wt_row_search IPC 0.692 0.971 +40.3%
      __wt_row_search Backend Bound 61.3% 50.1% -11.2pp
      __wt_row_search L1D MPKI 54.0 35.3 -34.7%
      __wt_row_search L1C misses (SPE) 9,169 3,698 -59.7%
      __wt_row_search on-CPU self-time 4.08% 3.26% -0.82pp
      __wt_row_leaf_key SPE latency 85.5 cy 26.8 cy -68.7%
      __wt_row_leaf_key Backend Bound 100% 36.3% -63.7pp
      __curfile_reopen SPE latency 85.8 cy 49.9 cy -41.9%
      SPE late prefetches (hardware proof) 0 1,082 Prefetches active
      Net on-CPU self-time saved -0.67pp

        - Estimated CPU benefit: ~.8% savings on in cache 100 read workload. 

      See multipatch results here. 

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Jawwad Asghar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: