Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Disagg CI-blocker, Test Python
Labels:
None

Assigned Teams:

Storage Engines, Storage Engines - Foundations, Storage Engines - Persistence
Sprint:
SE Persistence backlog
Story Points:
None

test_checkpoint_snapshot02 fails when running with the disagg hook (using PALM):

python3 ../test/suite/run.py --hook disagg -v 2 checkpoint_snapshot02 -s 4

The above command runs it with scenario 4, which uses a row store, and simulates a crash.

In the test, the test_checkpoint_snapshot function:

Opens with logging enabled, a small cache (10M), and some timing stress variables enabled
Creates a table and populates it - 1000 rows each entry about 500 bytes
Starts a transaction, and inserts the same amount, holding the txn open
Starts a background checkpoint thread and waits until the checkpoint has taken a snapshot (by looking at statistics)
Commits the txn
Tells the checkpoint thread to finish and waits for the thread
Simulates a crash by copying the directory and restarting on the copy
Running with the disagg hook (which turns table: URIs into layered: URIs), there's a failure on the restart. The stable table URI cannot be found, so opening the layered: URI for the (one) data table fails.

This indicates that the table creation was not persistent. In non-disagg it would be persistent because the connection is logged, so a table creation would be logged, and the log replayed on restart. In disagg, we don't do logging in WT (there is effectively no local storage). However, the table creation adds an entry to the shared metadata file, and if that gets checkpointed (as I expect it to by step 6), then we should see the URI on a restart.

Some possibilities: step 6 (checkpoint thread) is done via a special class/function in wtthread.py. Is there any possibility that the checkpoint doesn't actually complete, but the thread exits and joins the main thread? Another possibility is that PALM (using LLDB) is misconfigured to not be durable on transactions. Maybe "completing" a checkpoint requires some extra action to guarantee that all writes up to that point are recoverable. (We don't need fsync level guarantees for our testing though).

I'm going to explicitly disable/skip this test for disagg with a FIXME.

duplicates

WT-14984 Disagg: cannot read page after python test copies a directory (test_checkpoint_snapshot01)

Closed

is related to

WT-14984 Disagg: cannot read page after python test copies a directory (test_checkpoint_snapshot01)

Closed

related to

WT-14986 Fix checkpoint_id assertion for disagg delta reconciliation

Closed

WT-15452 failed: s-outdated-fixmes on infrequent-checks [wiredtiger @ 179a50e1]

Closed

Assignee:: Yury Ershov
Reporter:: Donald Anderson
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Jul 07 2025 08:43:14 PM UTC
Updated:: Sep 12 2025 03:39:42 AM UTC
Resolved:: Sep 09 2025 06:04:44 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates