-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Test wtperf
-
None
-
Storage Engines, Storage Engines - Transactions
-
SE Transactions - 2026-03-13
-
3
While investigating cache-stuck behaviour WT-16817 in disaggregated storage workloads, I found that Workgen’s timestamp advancement logic can stop too early. Currently, WorkloadRunner::increment_timestamp is tied to the global stopping flag, which may be set before all worker threads have actually finished. As a result, the timestamp thread can exit while the workload is still running, and stable/oldest timestamps stop moving even though application activity continues.
This behaviour is an artefact of the harness and can lead to misleading “stuck timestamp” or “cache-stuck” scenarios that are not caused by WiredTiger itself.
Problem
- increment_timestamp runs in a dedicated thread, but its loop is controlled by the global stopping variable.
- stopping is flipped as part of the overall shutdown sequence, even if some worker threads are still active (e.g., stalled under cache pressure, slow ops, etc.).
- Once stopping is set, the timestamp thread exits, so:
- Stable/oldest timestamps stop advancing.
- The workload may still be doing useful work, but appears to be running under frozen timestamps.
- This diverges from MongoDB’s behaviour, where timestamp advancement is a background responsibility and not tightly coupled to any particular application thread's lifetime.
Proposed Change
Introduce a dedicated control flag for the timestamp thread (e.g. stop_timestamp_thread) and treat timestamp advancement as an independent background service:
- The timestamp thread:
- Runs independently in the background.
- Periodically computes and sets stable/oldest timestamps based on configured lags.
- Ignores the stopping lifecycle of worker threads.
- Worker threads:
- Can stall, exit early, or be under cache pressure without affecting timestamp advancement.
- Shutdown:
- Only when the workload is truly finished do we explicitly flip stop_timestamp_thread and join the timestamp thread.
- is related to
-
WT-16817 Establish baseline metrics for PALite using workgen
-
- In Progress
-
- related to
-
SERVER-107597 Documentation Updates
-
- Blocked
-