-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Engines - Transactions
-
None
-
None
Summary:
The leaf-page binary search in __wt_row_search is severely memory-bound on Graviton4 (IPC 0.69, 61% backend-bound, L1D MPKI 54). Each iteration incurs two serial cache misses, one for the pg_row entry and one for the key data it points to. This stalls the CPU pipeline for hundreds of cycles. This patch introduces a 2-stage software prefetch pipeline that hides both miss latencies, along with complementary prefetches in the cursor reopen path and a TSAN correctness fix.
Problem:
On YCSB 100% read (in-cache, Graviton4 96-core, 128 threads, 3-node replica set), __wt_row_search is the hottest WiredTiger function. ARM Total Performance profiling shows:
- __wt_row_search: IPC 0.692, Backend Bound 61.3%, L1D MPKI 54.0, L1D Demand MPKI 48.4. The binary search loop stalls on two serial cache misses per iteration (load pg_row[mid], then load key data), each costing ~200+ cycles from L3/DRAM.
- __wt_row_leaf_key: IPC 0.500, 100% backend-bound, SPE average load latency 85.5 cycles, L1C miss slowdown 11.97x. This function is completely dominated by cache misses when decoding key data.
- __curfile_reopen: CPI 4.471, L1D MPKI 114.6, SPE average load latency 85.8 cycles. The cursor reopen path stalls on a cold dhandle->handle (WT_BTREE) cache line.
- __wt_session_lock_dhandle: The fast-path trylock reads dhandle->excl_session without TSAN suppression, triggering CI failures on every uncontended lock acquisition despite being race-free by construction.
The existing single-level key prefetch (1 iteration ahead) provides only ~20-40 cycles of runway. This is insufficient to hide an L3 miss on Graviton4's cache hierarchy (L1d 64KB, L2 2MB, L3 36MB shared).
Proposed Fix:
See changes in this patch.
Four changes across 4 files (row_srch.c, cur_file.c, cur_std.c, session_dhandle.c):
1. 2-stage prefetch pipeline in leaf binary search (row_srch.c: __wt_row_search):
- Level 2: Prefetch pg_row struct entries 2 iterations ahead via new {}row_search_prefetch_row_entry(). Brings the WT_ROW.{_}_key pointer into L1 so the subsequent key decode doesn't stall.
- Level 1: Existing __row_search_prefetch_key() prefetches key data 1 iteration ahead.
- Guarded by #if WT_ROW_SEARCH_PREFETCH_DEPTH >= 2 (tunable at build time, default 2). Only fires when limit > 6 to avoid wasted prefetches on final iterations.
- Applied to all three leaf binary search variants (short key fast path, collator path, general path).
2. Internal page WT_REF prefetch (row_srch.c: __wt_row_search):
- New __row_search_prefetch_ref() prefetches pindex->index[slot] (WT_REF entries) for the next two binary search candidates during tree descent, applied to all three internal page search variants (short key, collator, general). Fires when limit > 2.
3. Cursor dhandle->handle prefetch (cur_file.c: __curfile_reopen, cur_std.c: __wt_cursor_cache_get):
- Adds __builtin_prefetch(dhandle->handle, 0, 3) alongside the existing dhandle->rwlock write-intent prefetch. Warms the WT_BTREE struct into L1 before the cursor needs it.
Impact:
Validated with ARM TP (topdown + SPE) and DSI on-CPU profiling (perf record) on Graviton4:
| Metric | Baseline | With Patch | Delta |
|---|---|---|---|
| __wt_row_search IPC | 0.692 | 0.971 | +40.3% |
| __wt_row_search Backend Bound | 61.3% | 50.1% | -11.2pp |
| __wt_row_search L1D MPKI | 54.0 | 35.3 | -34.7% |
| __wt_row_search L1C misses (SPE) | 9,169 | 3,698 | -59.7% |
| __wt_row_search on-CPU self-time | 4.08% | 3.26% | -0.82pp |
| __wt_row_leaf_key SPE latency | 85.5 cy | 26.8 cy | -68.7% |
| __wt_row_leaf_key Backend Bound | 100% | 36.3% | -63.7pp |
| __curfile_reopen SPE latency | 85.8 cy | 49.9 cy | -41.9% |
| SPE late prefetches (hardware proof) | 0 | 1,082 | Prefetches active |
| Net on-CPU self-time saved | — | — | -0.67pp |
- Estimated CPU benefit: ~.8% savings on in cache 100 read workload.