Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Checkpoints, Metadata
Labels:
None

Assigned Teams:

Storage Engines - Persistence
Total Hours with Assigned Team:
2,088.594
Epic Link:
Consider supporting local tables on disaggregated storage connections
Sprint:
SE Persistence backlog
Story Points:
None

If WiredTiger crashes during a checkpoint after updating the metadata file but before completing the checkpoint in PALI, restarting WiredTiger and picking up the latest completed checkpoint would result in us having inconsistent timestamps.

This is how it happens:

Set the stable timestamp to T1
Create a checkpoint
Do some writes
Set the stable timestamp to T2
Create a checkpoint. This time, crash after the metadata table and the turtle file are updated to reflect stable timestamp T2, but before the checkpoint completes in PALI.
Restart WiredTiger (without lose_all_my_data), which will result in WiredTiger setting the stable timestamp to T2 from the metadata table.
Pick up the latest completed checkpoint from PALI, which results in picking up a checkpoint with timestamp T1 < T2.

This will result in a data inconsistency, as the local tables reflect stable timestamp T2 while the shared tables reflect timestamp T1. The caller would probably want to set the stable timestamp to T1 based on the picked-up checkpoint, which will result in the stable timestamp moving backwards, and that could be dangerous. Please refer to ~~WT-16711~~ for more details.

To solve this problem, we need to make the checkpoint completion in PALI and on disk atomic.

One way to do this might be:

Before updating the turtle file, save the previous contents to WiredTiger.turtle.prev. Then update the turtle file as usual.
Complete the checkpoint in PALI.
Remove WiredTiger.turtle.prev.

Then, during recovery:

If WiredTiger.turtle.prev is not found, proceed as usual.
Otherwise, fetch the metadata of the latest checkpoint from PALI, and compare them to the metadata from WiredTiger.turtle.prev.
If the metadata match WiredTiger.turtle.prev, then use WiredTiger.turtle.prev instead of WiredTiger.turtle. Otherwise use WiredTiger.turtle as usual.

We can (and arguably, should) skip this process if lose_all_my_data is true, as in that case, we do not make any guarantees about data consistency anyways. This will significantly lower the risk of proceeding with this work.

There is a possibility that the metadata in the checkpoint will be newer than WiredTiger.turtle, which will happen if another node stepped up in the meantime and created a new checkpoint; dealing with this situation would be the application's responsibility. We should never be in the situation where the metadata of the checkpoint from PALI is between WiredTiger.turtle.prev and WiredTiger.turtle, because that would mean that another node created a checkpoint while the original node was still a leader.

If we decide to implement this, we should roll back the test change in ~~WT-16711~~. This would provide us with a good and fairly reproducible test for this scenario.

Making this change will enable us to get more robust crash testing, such as timestamp_abort.

is related to

WT-16711 Crash/Recovery timestamp_abort (disagg=leader) records absent in collections table

Closed

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Peter Macko
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: May 06 2026 09:19:02 PM UTC
Updated:: May 07 2026 06:27:44 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates