-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Engines - Transactions
-
439.812
-
SE Transactions - 2026-06-19
-
1
Summary
In _rec_append_orig_value (src/reconcile/rec_visibility.c), the on-page value is copied into a session scratch buffer via wt_page_cell_data_ref_kv before being handed to wt_upd_alloc. For non-overflow cells (unpack->type == WT_CELL_VALUE), cell_data_ref does nothing beyond store->data = unpack->data; store->size = unpack->size; and _wt_upd_alloc memcpy's the value into the new WT_UPDATE immediately. The scratch buffer is wasted work in that path.
Proposed change
Pass a stack-local WT_ITEM populated from unpack->data / unpack->size directly to _wt_upd_alloc when unpack->type == WT_CELL_VALUE. Keep the scratch-buffer path for overflow cells, where _wt_ovfl_read needs a real buffer.
/* Before */ WT_ERR(__wt_scr_alloc(session, 0, &tmp)); WT_ERR(__wt_page_cell_data_ref_kv(session, page, unpack, tmp)); WT_ERR(__wt_upd_alloc(session, tmp, WT_UPDATE_STANDARD, &append, &size)); /* After */ WT_ITEM cell_ref; WT_ITEM *src; if (unpack->type == WT_CELL_VALUE) { cell_ref.data = unpack->data; cell_ref.size = unpack->size; src = &cell_ref; } else { WT_ERR(__wt_scr_alloc(session, 0, &tmp)); WT_ERR(__wt_page_cell_data_ref_kv(session, page, unpack, tmp)); src = tmp; } WT_ERR(__wt_upd_alloc(session, src, WT_UPDATE_STANDARD, &append, &size));
WT_CELL_VALUE_COPY cells decode to unpack->type == WT_CELL_VALUE after the copy-cell restart in __wt_cell_unpack_kv (only unpack->raw retains the COPY tag), so they take the fast path correctly. WT_CELL_VALUE_OVFL keeps unpack->type == WT_CELL_VALUE_OVFL and goes through the scratch path. WT_CELL_VALUE_OVFL_RM is excluded by the existing assert.
Motivation
Flamegraph from BF-41977 profile patch 6a15b28e (tpcc_majority_out_of_cache, clean mainline, WT-17490 + WT-17598 already merged):
- __rec_append_orig_value self-time: 1.19% DSC vs 0.06% ASC (+1.13 pp)
- Called more in DSC because the 30-min snapshot history window causes more upd_select->upd_saved == true triggers per reconciliation (see BF-41977 flamegraph attachment bf41977_dsc_workload_flamegraph.svg.gz)
Why this is low priority
Single-run sys-perf result on DSC tpcc_majority_out_of_cache (patch 6a166576, comparison 6a16707e): patch tpmC = 39,942.5 vs 7-day stable mean 43,528 (CoV 3.93%). Inconclusive — the workload's run-to-run noise is larger than the predicted gain. To prove the 1% effect against 4% CoV needs a 3-5 clone multipatch.
The change itself is small (20 lines), safe, and based on a real flamegraph hotspot. It's a reasonable bundle candidate if other reconciliation micro-optimizations get worked at the same time.
Verification
- __cell_data_ref only copies data/size fields for WT_CELL_VALUE (src/include/cell_inline.h line 1855)
- __wt_upd_alloc memcpy's the value into the new WT_UPDATE (src/include/txn_inline.h line 1476), so the stack-local WT_ITEM doesn't need to outlive the call
- Builds cleanly (verified locally)
- Sys-perf run on DSC tpcc_majority_out_of_cache completed without errors (just no measurable improvement at n=1)