Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Cache and Eviction
Labels:
None

Assigned Teams:

Storage Engines - Transactions
Total Hours with Assigned Team:
1,216.145
Sprint:
SE Transactions - 2026-08-14
Story Points:
3

Summary

WT-15538 recommends raising eviction_updates_trigger (with success reported "as high as 95%") as the remediation for the high-update-ratio eviction problem. This ticket evaluates making 95% the default across the sys-perf suite.

Verdict (FTDC-verified): the only causal effect of the change on sys-perf is a tail-latency regression on insert-bound workloads (bulk_insert). Every other workload is unaffected - the update-eviction path never engages because updates-in-cache stays far below the trigger. sys-perf does not reproduce the WT-15538 regime, so it cannot demonstrate the benefit; it only surfaces the downside. 95% should not ship as a global default on the strength of sys-perf.

Change tested

eviction_updates_trigger default 0 (auto = half of eviction_dirty_trigger, i.e. ~10% of cache) changed to 95.

Patch: sys-perf patch
Replicated comparison (3-clone managed multipatch, 4 patch runs/metric): Performance Analyzer

Result: one causal effect

Workload	Metric	Change	Z	Causal?
bulk_insert_w1	95th (InsertMany / Aggregated)	+13.0%	2.18	yes - real regression

No other workload showed a change attributable to this knob (see FTDC below). The replicated comparison did surface several read/mixed rows at |z| > 2 (linkbench2, mixed_workloads, find_one_and_update, ycsb in_cache), but FTDC proves the trigger never engages in those workloads, so they are not effects of this change and are excluded as results.

FTDC verification (cache = 19.33 GB; old trigger ~1.93 GB / 10%, new ~18.4 GB / 95%)

The trigger only acts once updates-in-cache crosses the threshold. Per-workload:

Workload (phase)	updates-in-cache (max)	% of cache	Crosses old 10% trigger?
bulk_insert_w1 (load)	6.19 GB	32%	yes (~3x over)
bulk_insert_w1 (load_with_indexes)	3.78 GB	20%	yes
linkbench2 (request_test)	0.99 GB	5.1%	no
mixed_workloads	0.53 GB	2.8%	no
find_one_and_update_embedded	0.62 GB	3.2%	no
ycsb in_cache 95read5update	0.28 GB	1.4%	no
ycsb out_of_cache 95read5update	0.61 GB	3.2%	no

Only bulk_insert pushes updates-in-cache past the old 10% trigger. There the old default recruited application threads to evict update content early; raising the trigger to 95% disables that, so update content accumulates to 3-6 GB and is flushed in bursts via dirty/general eviction -> +13% 95th-percentile spikes. This is a genuine, mechanism-backed regression.

For every other workload updates stay at 1.4-5.1% of cache, so the update-eviction path is inactive in both configurations and the change is a no-op. The application-thread eviction seen in linkbench request_test (160k requests) is driven by dirty hitting the 20% dirty trigger, which this change does not touch.

Why the non-bulk_insert rows are not effects of this change

Those rows are 4 self-consistent patch runs (low CoV) compared against the historical stable region, not a fresh baseline (only 2 of 141 rows had a direct base value). Since FTDC shows the change cannot affect these workloads, the few-percent offset from the historical band is attributable to base-commit / infra / drift over the stable window (Apr-May), not to the trigger change.

An earlier single-run comparison also flagged a large ycsb out_of_cache read regression (z=9.49); that was a 3-point warmup-phase artifact and did not survive replication. Same root cause: comparison against a stable region with no direct baseline.

Recommendation

Do not ship 95% as a global default based on sys-perf: the only causal effect observed is the bulk_insert tail regression.
sys-perf is the wrong test bed - no workload reproduces the WT-15538 regime (>= 72 GB cache, sustained 2000+ updates/s, update ratio > dirty ratio). Validate the benefit on a workload that actually drives updates-in-cache past the trigger (or a HELP-ticket repro), where the high-trigger remediation is known to help.
If a default change is still desired, evaluate an intermediate value and measure specifically on update-heavy workloads; weigh any gain against the bulk_insert tail-latency cost.

Method note

Base side of the replicated comparison is the historical stable region (2/141 rows had a fresh direct baseline). Patch-side CoV across the 4 runs is tight (1-7%) and stable-region sample sizes are healthy (n=24-79), so the measurements are precise - but precision against a historical band is not the same as a causal A/B, which is why FTDC was needed to separate real effects from drift.

is related to

WT-15538 Investigate slow eviction behavior when updates ratio is high

Open

related to

WT-15538 Investigate slow eviction behavior when updates ratio is high

Open

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Haribabu Kommi
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: May 31 2026 05:26:07 AM UTC
Updated:: Jul 03 2026 01:04:22 AM UTC

Details

Description

Summary

Change tested

Result: one causal effect

FTDC verification (cache = 19.33 GB; old trigger ~1.93 GB / 10%, new ~18.4 GB / 95%)

Why the non-bulk_insert rows are not effects of this change

Recommendation

Method note

Attachments

Issue Links

Activity

People

Dates