-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Test wtperf
-
None
-
Storage Engines, Storage Engines - Transactions
-
SE Transactions - 2026-03-13
-
3
Summary
While investigating cache-stuck behavior for disaggregated storage workloads WT-16817, I found that the apparent “stuck timestamps” were caused by a locking/concurrency bug in Workgen, not WiredTiger itself. Workgen can hold _ts_mutex while calling session->prepare_transaction / session->commit_transaction. If those calls block, the timestamp thread is prevented from advancing stable/oldest timestamps.
Problem
- Workgen uses a shared mutex _ts_mutex to protect its timestamp state.
- In the transaction path, Workgen calls:
- session->prepare_transaction
- session->commit_transaction while holding _ts_mutex.
- Under load (e.g., cache pressure), these calls can block. If they do, the worker thread remains blocked while still holding _ts_mutex.
- The timestamp thread also needs _ts_mutex to compute and publish new stable/oldest timestamps.
- Once a worker is stuck in prepare/commit with _ts_mutex held, the timestamp thread cannot acquire _ts_mutex, and stable/oldest timestamps stop moving.
- This manifests as an apparent WiredTiger “cache-stuck / timestamp-stuck” scenario, but is actually an artifact of the Workgen.
Root Cause
- _ts_mutex critical sections were too broad:
-
- They wrapped both timestamp computation (WorkgenTimeStamp::get_timestamp[_lag]) and potentially blocking WT API calls.
- This creates a lock-holding/blocking pattern where a stalled WT call can starve the timestamp thread.
Proposed Fix
- Narrow the scope of _ts_mutex to only cover timestamp computation:
-
- Hold _ts_mutex just long enough to call WorkgenTimeStamp::get_timestamp[_lag](...).
- Release _ts_mutex before calling any WiredTiger APIs:
- conn->set_timestamp
- session->prepare_transaction
- session->commit_transaction
- Commit path changes:
- For use_prepare_timestamp:
- Take _ts_mutex, compute prepare_ts, release _ts_mutex.
- Call prepare_transaction with prepare_timestamp=... (no _ts_mutex held).
- Call commit_transaction with commit_timestamp=..., durable_timestamp=... (no _ts_mutex held).
- For use_commit_timestamp:
- Take _ts_mutex, compute commit_ts, release _ts_mutex.
- Call commit_transaction with commit_timestamp=... (no _ts_mutex held).
- For use_prepare_timestamp:
- is related to
-
WT-16817 Establish baseline metrics for PALite using workgen
-
- In Progress
-
- related to
-
SERVER-107597 Documentation Updates
-
- Blocked
-