-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Checkpoints
-
None
-
Storage Engines - Persistence
-
None
-
None
An automated test of parallel checkpoints failed with an error saying that sweep had not run for 60 minutes. The test was cppsuite-reverse-split-stress.
The callstacks indicate that checkpoint worker threads are trying to help make space available in the cache:
[2026/03/13 16:56:30.316] Thread 37 (Thread 0x7fab653b7700 (LWP 3695) "checkpoint-p 1"): [2026/03/13 16:56:30.316] #0 0x00007fab659537d1 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 [2026/03/13 16:56:30.316] #1 0x00007fab65be0235 in __wt_cond_wait_signal (session=session@entry=0x71ab7f800fa0, cond=0x71ab7fe10980, usecs=usecs@entry=1000000, run_func=run_func@entry=0x7fab65aaf69a <__checkpoint_parallel_thread_chk>, signalled=signalled@entry=0x7fab653b4197) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/os_posix/os_mtx_cond.c:115 [2026/03/13 16:56:30.316] #2 0x00007fab65ab1678 in __checkpoint_parallel_thread_run (session=0x71ab7f800fa0, thread=<optimized out>) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/checkpoint/checkpoint_parallel.c:212 [2026/03/13 16:56:30.316] #3 0x00007fab65ca0d80 in __thread_run (arg=0x71ab7fe0c1e0) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/support/thread_group.c:32 [2026/03/13 16:56:30.316] #4 0x00007fab6594c609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 [2026/03/13 16:56:30.316] #5 0x00007fab656f5353 in clone () from /lib/x86_64-linux-gnu/libc.so.6
The parallel checkpoint worker threads should be excluded from participating in cache management when it is over-subscribed. Interestingly, the checkpoint-parallel workers appear to only allocate a single transaction during their lifecycle. The test must be getting cache-stuck-full before the first checkpoint is run.
This ticket should also consider whether the checkpoint worker thread should close/reopen the transaction more regularly.
The current code in __checkpoint_parallel_thread_run looks like:
220 /* Begin a transaction, if we don't already have one. */ 221 if (!F_ISSET(session->txn, WT_TXN_RUNNING)) { 222 WT_ERR(__wt_txn_begin(session, NULL)); 223 F_SET(session, WT_SESSION_CHECKPOINT); 224 F_SET(session, WT_SESSION_CHECKPOINT_WORKER); 225 } 226 227 /* Set up the transaction for the given entry. */ 228 __wt_txn_import_snapshot(session, entry->snapshot);
Note that it updates the snapshot every time, but I can't see where it ever changes the closes and re-opens the transaction. That might be OK (if this transaction never has an ID or anything else associated with another transaction), but it does make me a bit nervous.