Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Storage Engines - Transactions
Total Hours with Assigned Team:
1,401.288
Sprint:
SE Transactions - 2026-05-22
Story Points:
None

Summary:

The leaf-page binary search in __wt_row_search is severely memory-bound on Graviton4 (IPC 0.69, 61% backend-bound, L1D MPKI 54). Each iteration incurs two serial cache misses, one for the pg_row entry and one for the key data it points to. This stalls the CPU pipeline for hundreds of cycles. This patch introduces a 2-stage software prefetch pipeline that hides both miss latencies, along with complementary prefetches in the cursor reopen path and a TSAN correctness fix.

Problem:
On YCSB 100% read (in-cache, Graviton4 96-core, 128 threads, 3-node replica set), __wt_row_search is the hottest WiredTiger function. ARM Total Performance profiling shows:

__wt_row_search: IPC 0.692, Backend Bound 61.3%, L1D MPKI 54.0, L1D Demand MPKI 48.4. The binary search loop stalls on two serial cache misses per iteration (load pg_row[mid], then load key data), each costing ~200+ cycles from L3/DRAM.
__wt_row_leaf_key: IPC 0.500, 100% backend-bound, SPE average load latency 85.5 cycles, L1C miss slowdown 11.97x. This function is completely dominated by cache misses when decoding key data.
__curfile_reopen: CPI 4.471, L1D MPKI 114.6, SPE average load latency 85.8 cycles. The cursor reopen path stalls on a cold dhandle->handle (WT_BTREE) cache line.
__wt_session_lock_dhandle: The fast-path trylock reads dhandle->excl_session without TSAN suppression, triggering CI failures on every uncontended lock acquisition despite being race-free by construction.

The existing single-level key prefetch (1 iteration ahead) provides only ~20-40 cycles of runway. This is insufficient to hide an L3 miss on Graviton4's cache hierarchy (L1d 64KB, L2 2MB, L3 36MB shared).

Proposed Fix:

See changes in this patch.

Four changes across 4 files (row_srch.c, cur_file.c, cur_std.c, session_dhandle.c):

1. 2-stage prefetch pipeline in leaf binary search (row_srch.c: __wt_row_search):

Level 2: Prefetch pg_row struct entries 2 iterations ahead via new {}row_search_prefetch_row_entry(). Brings the WT_ROW.{_}_key pointer into L1 so the subsequent key decode doesn't stall.
Level 1: Existing __row_search_prefetch_key() prefetches key data 1 iteration ahead.
Guarded by #if WT_ROW_SEARCH_PREFETCH_DEPTH >= 2 (tunable at build time, default 2). Only fires when limit > 6 to avoid wasted prefetches on final iterations.
Applied to all three leaf binary search variants (short key fast path, collator path, general path).

2. Internal page WT_REF prefetch (row_srch.c: __wt_row_search):

New __row_search_prefetch_ref() prefetches pindex->index[slot] (WT_REF entries) for the next two binary search candidates during tree descent, applied to all three internal page search variants (short key, collator, general). Fires when limit > 2.

3. Cursor dhandle->handle prefetch (cur_file.c: __curfile_reopen, cur_std.c: __wt_cursor_cache_get):

Adds __builtin_prefetch(dhandle->handle, 0, 3) alongside the existing dhandle->rwlock write-intent prefetch. Warms the WT_BTREE struct into L1 before the cursor needs it.

Impact:

Validated with ARM TP (topdown + SPE) and DSI on-CPU profiling (perf record) on Graviton4:

Metric	Baseline	With Patch	Delta
__wt_row_search IPC	0.692	0.971	+40.3%
__wt_row_search Backend Bound	61.3%	50.1%	-11.2pp
__wt_row_search L1D MPKI	54.0	35.3	-34.7%
__wt_row_search L1C misses (SPE)	9,169	3,698	-59.7%
__wt_row_search on-CPU self-time	4.08%	3.26%	-0.82pp
__wt_row_leaf_key SPE latency	85.5 cy	26.8 cy	-68.7%
__wt_row_leaf_key Backend Bound	100%	36.3%	-63.7pp
__curfile_reopen SPE latency	85.8 cy	49.9 cy	-41.9%
SPE late prefetches (hardware proof)	0	1,082	Prefetches active
Net on-CPU self-time saved	—	—	-0.67pp

- Estimated CPU benefit: ~.8% savings on in cache 100 read workload.

See multipatch results here.

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Jawwad Asghar
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Mar 04 2026 04:51:56 PM UTC
Updated:: Apr 09 2026 07:21:06 AM UTC

Details

Description

Attachments

Activity

People

Dates