Panic on error during checkpoint in disagg

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints
    • None
    • Storage Engines - Persistence
    • SE Persistence backlog
    • None

      In disaggregated storage we have a number of paths during checkpoint that are non-recoverable on error. In theory they could be but at this stage are not, and an in depth analysis to determine how to recover, and what other paths need recovery is expensive.

      To avoid the likelihood of data corruption we should panic for the time being.

      The first path of concern is the shared metadata queue drain, which on error will have mutated the original queue and has no path to restore it.

      Some other paths I have concerns about are:

      1. _wt_disagg_put_checkpoint_meta calls _disagg_put_page, can that be rolled back
      2. When starting a new checkpoitn _disagg_begin_checkpoint is called but there's no call to abandon_checkpoint if we fail the checkpoint. Instead we set ckpt_success=false and call into _wt_disagg_advance_checkpoint  which will begin a subsequent checkpoint.
      3. plh_put called from root page writes in __wti_block_disagg_write_internal ? Can those be rolled back?

      Scope:

      • On any error in disagg checkpoint throw a WT_PANIC.
      • Add a test.

            Assignee:
            Sean Watt
            Reporter:
            Luke Pearson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: