Exclude checkpoint workers from waiting on the cache

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints
    • None

      An automated test of parallel checkpoints failed with an error saying that sweep had not run for 60 minutes. The test was cppsuite-reverse-split-stress.

      The callstacks indicate that checkpoint worker threads are trying to help make space available in the cache:

      [2026/03/13 16:56:30.316] Thread 37 (Thread 0x7fab653b7700 (LWP 3695) "checkpoint-p 1"):
      [2026/03/13 16:56:30.316] #0  0x00007fab659537d1 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
      [2026/03/13 16:56:30.316] #1  0x00007fab65be0235 in __wt_cond_wait_signal (session=session@entry=0x71ab7f800fa0, cond=0x71ab7fe10980, usecs=usecs@entry=1000000, run_func=run_func@entry=0x7fab65aaf69a <__checkpoint_parallel_thread_chk>, signalled=signalled@entry=0x7fab653b4197) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/os_posix/os_mtx_cond.c:115
      [2026/03/13 16:56:30.316] #2  0x00007fab65ab1678 in __checkpoint_parallel_thread_run (session=0x71ab7f800fa0, thread=<optimized out>) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/checkpoint/checkpoint_parallel.c:212
      [2026/03/13 16:56:30.316] #3  0x00007fab65ca0d80 in __thread_run (arg=0x71ab7fe0c1e0) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/support/thread_group.c:32
      [2026/03/13 16:56:30.316] #4  0x00007fab6594c609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
      [2026/03/13 16:56:30.316] #5  0x00007fab656f5353 in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

      The parallel checkpoint worker threads should be excluded from participating in cache management when it is over-subscribed. Interestingly, the checkpoint-parallel workers appear to only allocate a single transaction during their lifecycle. The test must be getting cache-stuck-full before the first checkpoint is run.

      This ticket should also consider whether the checkpoint worker thread should close/reopen the transaction more regularly.

      The current code in __checkpoint_parallel_thread_run looks like:

      220         /* Begin a transaction, if we don't already have one. */
      221         if (!F_ISSET(session->txn, WT_TXN_RUNNING)) {
      222             WT_ERR(__wt_txn_begin(session, NULL));
      223             F_SET(session, WT_SESSION_CHECKPOINT);
      224             F_SET(session, WT_SESSION_CHECKPOINT_WORKER);
      225         }
      226
      227         /* Set up the transaction for the given entry. */
      228         __wt_txn_import_snapshot(session, entry->snapshot);
      

      Note that it updates the snapshot every time, but I can't see where it ever changes the closes and re-opens the transaction. That might be OK (if this transaction never has an ID or anything else associated with another transaction), but it does make me a bit nervous.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Alexander Gorrod
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: