Workgen: fix stopping‑flag concurrency bug that prematurely halts the timestamp thread

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • WT12.0.0
    • Affects Version/s: None
    • Component/s: Test wtperf
    • None
    • Storage Engines, Storage Engines - Transactions
    • SE Transactions - 2026-03-13
    • 3

      While investigating cache-stuck behaviour WT-16817 in disaggregated storage workloads, I found that Workgen’s timestamp advancement logic can stop too early. Currently, WorkloadRunner::increment_timestamp is tied to the global stopping flag, which may be set before all worker threads have actually finished. As a result, the timestamp thread can exit while the workload is still running, and stable/oldest timestamps stop moving even though application activity continues.

      This behaviour is an artefact of the harness and can lead to misleading “stuck timestamp” or “cache-stuck” scenarios that are not caused by WiredTiger itself.

      Problem

      • increment_timestamp runs in a dedicated thread, but its loop is controlled by the global stopping variable.
      • stopping is flipped as part of the overall shutdown sequence, even if some worker threads are still active (e.g., stalled under cache pressure, slow ops, etc.).
      • Once stopping is set, the timestamp thread exits, so:
        • Stable/oldest timestamps stop advancing.
        • The workload may still be doing useful work, but appears to be running under frozen timestamps.
      • This diverges from MongoDB’s behaviour, where timestamp advancement is a background responsibility and not tightly coupled to any particular application thread's lifetime.

      Proposed Change

      Introduce a dedicated control flag for the timestamp thread (e.g. stop_timestamp_thread) and treat timestamp advancement as an independent background service:

      • The timestamp thread:
        • Runs independently in the background.
        • Periodically computes and sets stable/oldest timestamps based on configured lags.
        • Ignores the stopping lifecycle of worker threads.
      • Worker threads:
        • Can stall, exit early, or be under cache pressure without affecting timestamp advancement.
      • Shutdown:
        • Only when the workload is truly finished do we explicitly flip stop_timestamp_thread and join the timestamp thread.

            Assignee:
            Ravi Giri
            Reporter:
            Ravi Giri
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: