Make checkpoint completion on local disk and in PALI atomic

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints, Metadata
    • None

      If WiredTiger crashes during a checkpoint after updating the metadata file but before completing the checkpoint in PALI, restarting WiredTiger and picking up the latest completed checkpoint would result in us having inconsistent timestamps.

      This is how it happens:

      1. Set the stable timestamp to T1
      2. Create a checkpoint
      3. Do some writes
      4. Set the stable timestamp to T2
      5. Create a checkpoint. This time, crash after the metadata table and the turtle file are updated to reflect stable timestamp T2, but before the checkpoint completes in PALI.
      6. Restart WiredTiger (without lose_all_my_data), which will result in WiredTiger setting the stable timestamp to T2 from the metadata table.
      7. Pick up the latest completed checkpoint from PALI, which results in picking up a checkpoint with timestamp T1 < T2.

      This will result in a data inconsistency, as the local tables reflect stable timestamp T2 while the shared tables reflect timestamp T1. The caller would probably want to set the stable timestamp to T1 based on the picked-up checkpoint, which will result in the stable timestamp moving backwards, and that could be dangerous. Please refer to WT-16711 for more details.

      To solve this problem, we need to make the checkpoint completion in PALI and on disk atomic.

      One way to do this might be:

      • Before updating the turtle file, save the previous contents to WiredTiger.turtle.prev. Then update the turtle file as usual.
      • Complete the checkpoint in PALI.
      • Remove WiredTiger.turtle.prev.

      Then, during recovery:

      • If WiredTiger.turtle.prev is not found, proceed as usual.
      • Otherwise, fetch the metadata of the latest checkpoint from PALI, and compare them to the metadata from WiredTiger.turtle.prev.
      • If the metadata match WiredTiger.turtle.prev, then use WiredTiger.turtle.prev instead of WiredTiger.turtle. Otherwise use WiredTiger.turtle as usual.

      We can (and arguably, should) skip this process if lose_all_my_data is true, as in that case, we do not make any guarantees about data consistency anyways. This will significantly lower the risk of proceeding with this work.

      There is a possibility that the metadata in the checkpoint will be newer than WiredTiger.turtle, which will happen if another node stepped up in the meantime and created a new checkpoint; dealing with this situation would be the application's responsibility. We should never be in the situation where the metadata of the checkpoint from PALI is between WiredTiger.turtle.prev and WiredTiger.turtle, because that would mean that another node created a checkpoint while the original node was still a leader.

      If we decide to implement this, we should roll back the test change in WT-16711. This would provide us with a good and fairly reproducible test for this scenario.

      Making this change will enable us to get more robust crash testing, such as timestamp_abort.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Peter Macko
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: