Shared history store verify failure in test/format due to missing checkpoint pickup

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Test Format
    • None
    • Storage Engines - Foundations
    • 90.104
    • None
    • None

      Issue Summary

      A hs verify failure occurs during wiredtiger_open in follower mode when running the disagg test format switch test. The failure is triggered because the shared history store is being verified without a checkpoint having been picked up, resulting in inconsistent global timestamp and disaggregated storage values (e.g., last_checkpoint_timestamp is WT_TS_NONE).

      Context

      • The test/format does not delete local files on startup, retaining local tables and metadata from previous runs.
      • The global timestamps in txn_global appear to be set from the local metadata table and turtle file, not from an actual checkpoint.
      • Despite not picking up a checkpoint, the shared history store is accessible due to retained local metadata.
      • The typical MongoDB sequence is to pick up a checkpoint when starting as a follower, but test/format does not do this automatically.
      • This leads to the ability to access shared data without a checkpoint, which is fundamentally incorrect.
      • Workarounds such as gating operations on last_checkpoint_meta_lsn != WT_DISAGG_LSN_NONE or passing checkpoint_meta config have been suggested, but the underlying issue remains.
      • The conversation suggests that the test should either pick up a checkpoint automatically in wiredtiger_open or gate/abort if a checkpoint has not been picked up.

      Proposed Solution

      • Modify wiredtiger_open in test/format to automatically pick up a checkpoint if local metadata is retained and no checkpoint has been picked up.
      • Alternatively, gate operations on the presence of a valid checkpoint (e.g., last_checkpoint_meta_lsn != WT_DISAGG_LSN_NONE) and abort/crash if this condition is not met to prevent accessing shared data without a checkpoint.
      • Review and update test/format startup sequence to ensure correct checkpoint handling and prevent inconsistent state.

      Original Slack thread
      This ticket was generated by AI from a Slack thread.

            Assignee:
            Unassigned
            Reporter:
            Memento Slack Bot
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: