Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0
Affects Version/s: None
Component/s: Checkpoints
Labels:
None

Assigned Teams:

Storage Engines - Persistence
Total Hours with Assigned Team:
1,122.851
Epic Link:
Implement parallel checkpoint
Sprint:
SE Persistence - 2026-03-27
Story Points:
None

An automated test of parallel checkpoints failed with an error saying that sweep had not run for 60 minutes. The test was cppsuite-reverse-split-stress.

The callstacks indicate that checkpoint worker threads are trying to help make space available in the cache:

[2026/03/13 16:56:30.316] Thread 37 (Thread 0x7fab653b7700 (LWP 3695) "checkpoint-p 1"):
[2026/03/13 16:56:30.316] #0  0x00007fab659537d1 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
[2026/03/13 16:56:30.316] #1  0x00007fab65be0235 in __wt_cond_wait_signal (session=session@entry=0x71ab7f800fa0, cond=0x71ab7fe10980, usecs=usecs@entry=1000000, run_func=run_func@entry=0x7fab65aaf69a <__checkpoint_parallel_thread_chk>, signalled=signalled@entry=0x7fab653b4197) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/os_posix/os_mtx_cond.c:115
[2026/03/13 16:56:30.316] #2  0x00007fab65ab1678 in __checkpoint_parallel_thread_run (session=0x71ab7f800fa0, thread=<optimized out>) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/checkpoint/checkpoint_parallel.c:212
[2026/03/13 16:56:30.316] #3  0x00007fab65ca0d80 in __thread_run (arg=0x71ab7fe0c1e0) at /data/mci/2b4ceab7edf20c28b61c5059d0097787/wiredtiger/src/support/thread_group.c:32
[2026/03/13 16:56:30.316] #4  0x00007fab6594c609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
[2026/03/13 16:56:30.316] #5  0x00007fab656f5353 in clone () from /lib/x86_64-linux-gnu/libc.so.6

The parallel checkpoint worker threads should be excluded from participating in cache management when it is over-subscribed. Interestingly, the checkpoint-parallel workers appear to only allocate a single transaction during their lifecycle. The test must be getting cache-stuck-full before the first checkpoint is run.

This ticket should also consider whether the checkpoint worker thread should close/reopen the transaction more regularly.

The current code in __checkpoint_parallel_thread_run looks like:

220         /* Begin a transaction, if we don't already have one. */
221         if (!F_ISSET(session->txn, WT_TXN_RUNNING)) {
222             WT_ERR(__wt_txn_begin(session, NULL));
223             F_SET(session, WT_SESSION_CHECKPOINT);
224             F_SET(session, WT_SESSION_CHECKPOINT_WORKER);
225         }
226
227         /* Set up the transaction for the given entry. */
228         __wt_txn_import_snapshot(session, entry->snapshot);

Note that it updates the snapshot every time, but I can't see where it ever changes the closes and re-opens the transaction. That might be OK (if this transaction never has an ID or anything else associated with another transaction), but it does make me a bit nervous.

Assignee:: Peter Macko
Reporter:: Alexander Gorrod
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Mar 14 2026 11:42:28 AM UTC
Updated:: Mar 20 2026 04:12:26 PM UTC
Resolved:: Mar 20 2026 04:12:26 PM UTC

Details

Description

Attachments

Activity

People

Dates