-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Test Python
-
None
-
Storage Engines, Storage Engines - Persistence
-
SE Persistence backlog
-
None
test_checkpoint_snapshot02 fails when running with the disagg hook (using PALM):
python3 ../test/suite/run.py --hook disagg -v 2 checkpoint_snapshot02 -s 4
The above command runs it with scenario 4, which uses a row store, and simulates a crash.
In the test, the test_checkpoint_snapshot function:
- Opens with logging enabled, a small cache (10M), and some timing stress variables enabled
- Creates a table and populates it - 1000 rows each entry about 500 bytes
- Starts a transaction, and inserts the same amount, holding the txn open
- Starts a background checkpoint thread and waits until the checkpoint has taken a snapshot (by looking at statistics)
- Commits the txn
- Tells the checkpoint thread to finish and waits for the thread
- Simulates a crash by copying the directory and restarting on the copy
- Running with the disagg hook (which turns table: URIs into layered: URIs), there's a failure on the restart. The stable table URI cannot be found, so opening the layered: URI for the (one) data table fails.
This indicates that the table creation was not persistent. In non-disagg it would be persistent because the connection is logged, so a table creation would be logged, and the log replayed on restart. In disagg, we don't do logging in WT (there is effectively no local storage). However, the table creation adds an entry to the shared metadata file, and if that gets checkpointed (as I expect it to by step 6), then we should see the URI on a restart.
Some possibilities: step 6 (checkpoint thread) is done via a special class/function in wtthread.py. Is there any possibility that the checkpoint doesn't actually complete, but the thread exits and joins the main thread? Another possibility is that PALM (using LLDB) is misconfigured to not be durable on transactions. Maybe "completing" a checkpoint requires some extra action to guarantee that all writes up to that point are recoverable. (We don't need fsync level guarantees for our testing though).
I'm going to explicitly disable/skip this test for disagg with a FIXME.