Handle checkpoint thread errors gracefully

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • WT12.0.0
    • Affects Version/s: None
    • Component/s: Checkpoints
    • None

      There is a test case that tests a slow-locking implementation that is encountering a case where the coordination between the checkpoint server and workers does not seem right.
      The threads are:

      [2026/03/13 16:03:58.308]   Id   Target Id                                            Frame
      [2026/03/13 16:03:58.308] * 1    Thread 0xffff9c472040 (LWP 133466) "python3"         0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.308]   2    Thread 0xffff96322b80 (LWP 133715) "log-wrlsn-serve" 0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.308]   3    Thread 0xffff96b32b80 (LWP 133714) "log-close-serve" 0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.308]   4    Thread 0xffff97342b80 (LWP 133615) "tiered-server"   0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.308]   5    Thread 0xffff98362b80 (LWP 133613) "checkpoint-p 4"  0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.308]   6    Thread 0xffff98b72b80 (LWP 133612) "checkpoint-p 3"  0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.308]   7    Thread 0xffff99382b80 (LWP 133611) "checkpoint-p 2"  0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      

      Showing 3 checkpoint worker threads. Those worker threads are all idle:\

      [2026/03/13 16:03:58.357] Thread 7 (Thread 0xffff99382b80 (LWP 133611) "checkpoint-p 2"):
      [2026/03/13 16:03:58.357] #0  0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #1  0x0000ffff9acc1e78 [PAC] in pthread_cond_timedwait@@GLIBC_2.17 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #2  0x0000ffff9a20093c [PAC] in __wt_cond_wait_signal (session=session@entry=0x5166ff663638, cond=0x5166ffe1e990, usecs=1000000, run_func=run_func@entry=0xffff9a0fbc00 <__checkpoint_parallel_thread_chk>, signalled=signalled@entry=0xffff9938223f) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/os_posix/os_mtx_cond.c:115
      [2026/03/13 16:03:58.357] #3  0x0000ffff9a0fc7f4 in __checkpoint_parallel_thread_run (session=0x5166ff663638, thread=<optimized out>) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_parallel.c:212
      [2026/03/13 16:03:58.357] #4  0x0000ffff9a29ef98 in __thread_run (arg=0x5166ffe1c960) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/support/thread_group.c:32
      [2026/03/13 16:03:58.357] #5  0x0000ffff9acc2834 in start_thread () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #6  0x0000ffff9ac66e5c [PAC] in thread_start () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] Thread 6 (Thread 0xffff98b72b80 (LWP 133612) "checkpoint-p 3"):
      [2026/03/13 16:03:58.357] #0  0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #1  0x0000ffff9acc1e78 [PAC] in pthread_cond_timedwait@@GLIBC_2.17 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #2  0x0000ffff9a20093c [PAC] in __wt_cond_wait_signal (session=session@entry=0x5166ff663da0, cond=0x5166ffe1e990, usecs=1000000, run_func=run_func@entry=0xffff9a0fbc00 <__checkpoint_parallel_thread_chk>, signalled=signalled@entry=0xffff98b7223f) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/os_posix/os_mtx_cond.c:115
      [2026/03/13 16:03:58.357] #3  0x0000ffff9a0fc7f4 in __checkpoint_parallel_thread_run (session=0x5166ff663da0, thread=<optimized out>) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_parallel.c:212
      [2026/03/13 16:03:58.357] #4  0x0000ffff9a29ef98 in __thread_run (arg=0x5166ffe1c9b0) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/support/thread_group.c:32
      [2026/03/13 16:03:58.357] #5  0x0000ffff9acc2834 in start_thread () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #6  0x0000ffff9ac66e5c [PAC] in thread_start () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] Thread 5 (Thread 0xffff98362b80 (LWP 133613) "checkpoint-p 4"):
      [2026/03/13 16:03:58.357] #0  0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #1  0x0000ffff9acc1e78 [PAC] in pthread_cond_timedwait@@GLIBC_2.17 () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #2  0x0000ffff9a20093c [PAC] in __wt_cond_wait_signal (session=session@entry=0x5166ff664508, cond=0x5166ffe1e990, usecs=1000000, run_func=run_func@entry=0xffff9a0fbc00 <__checkpoint_parallel_thread_chk>, signalled=signalled@entry=0xffff9836223f) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/os_posix/os_mtx_cond.c:115
      [2026/03/13 16:03:58.357] #3  0x0000ffff9a0fc7f4 in __checkpoint_parallel_thread_run (session=0x5166ff664508, thread=<optimized out>) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_parallel.c:212
      [2026/03/13 16:03:58.357] #4  0x0000ffff9a29ef98 in __thread_run (arg=0x5166ffe1ca00) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/support/thread_group.c:32
      [2026/03/13 16:03:58.357] #5  0x0000ffff9acc2834 in start_thread () from /lib64/libc.so.6
      [2026/03/13 16:03:58.357] #6  0x0000ffff9ac66e5c [PAC] in thread_start () from /lib64/libc.so.6
      

      At the same time, a thread doing connection close is waiting on a semaphore (which presumably the workers should signal):

      [2026/03/13 16:03:59.145] Thread 1 (Thread 0xffff9c472040 (LWP 133466) "python3"):
      [2026/03/13 16:03:59.145] #0  0x0000ffff9acbefb4 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/03/13 16:03:59.145] #1  0x0000ffff9accada0 [PAC] in __new_sem_wait_slow64.constprop.0 () from /lib64/libc.so.6
      [2026/03/13 16:03:59.145] #2  0x0000ffff9a2011dc [PAC] in __wt_semaphore_wait (session=session@entry=0x5166ff66acb8, sem=sem@entry=0x5166ff8f3578) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/os_posix/os_mtx_sem.c:68
      [2026/03/13 16:03:59.145] #3  0x0000ffff9a0fd188 in __wt_checkpoint_parallel_finish (session=session@entry=0x5166ff66acb8) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_parallel.c:393
      [2026/03/13 16:03:59.145] #4  0x0000ffff9a0d2984 in __wt_sync_file (session=session@entry=0x5166ff66acb8, syncop=syncop@entry=WT_SYNC_CHECKPOINT) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/btree/bt_sync.c:293
      [2026/03/13 16:03:59.145] #5  0x0000ffff9a103e04 in __checkpoint_tree (session=session@entry=0x5166ff66acb8, is_checkpoint=is_checkpoint@entry=true, cfg=0xffffea49f290) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_txn.c:2781
      [2026/03/13 16:03:59.145] #6  0x0000ffff9a105584 in __checkpoint_tree_helper (session=0x5166ff66acb8, cfg=0xffffea49f290) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_txn.c:2943
      [2026/03/13 16:03:59.145] #7  __checkpoint_apply_to_dhandles (session=0x5166ff66acb8, cfg=0xffffea49f290, op=<optimized out>) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_txn.c:338
      [2026/03/13 16:03:59.145] #8  __checkpoint_db_internal (session=0x5166ff66acb8, cfg=0xffffea49f290) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_txn.c:1544
      [2026/03/13 16:03:59.145] #9  __checkpoint_db_wrapper (session=session@entry=0x5166ff66acb8, cfg=cfg@entry=0xffffea49f290) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_txn.c:1954
      [2026/03/13 16:03:59.145] #10 0x0000ffff9a107c80 in __wt_checkpoint_db (session=0x5166ff66acb8, cfg=cfg@entry=0xffffea49f290, waiting=waiting@entry=true) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/checkpoint/checkpoint_txn.c:2035
      [2026/03/13 16:03:59.145] #11 0x0000ffff9a2abdf4 in __wt_txn_global_shutdown (session=session@entry=0x5166ff662000, cfg=cfg@entry=0xffffea49f370) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/txn/txn.c:2623
      [2026/03/13 16:03:59.145] #12 0x0000ffff9a111058 in __conn_close (wt_conn=0x5166ff8f2000, config=<optimized out>) at /data/mci/ac01179377cfde5eac92c5e038c7ad64/wiredtiger/src/conn/conn_api.c:1255
      

      Note that this failure happened after the changes in WT-16909 which addressed a similar (but possibly opposite?) issue.

            Assignee:
            Peter Macko
            Reporter:
            Alexander Gorrod
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: