Spike: Investigate crash recovery testing strategy for Disagg

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Not Applicable
    • None
    • Storage Engines - Foundations
    • SE Foundations - 2026-02-27
    • 5

      Background
      We currently have a set of Disagg tests that validate happy-path scenarios using test/format and Python-based test suites. While these tests provide confidence in stability & correctness under normal operation, they do not explicitly validate crash and recovery behaviour.

      In a Disagg context, crash recovery can have different implications compared to ASC storage, and it is not yet clear what it truly means to test crash recovery and which existing tests could be extended to cover this area.

      Before committing to implementing crash recovery tests, we need a clearer understanding of:

      • What scenarios we actually want to validate for crash recovery in Disagg
      • Which existing tests, if any, are good candidates to be adapted for this purpose (timestamp_abort seems like the best candidate for now)
      • What gaps exist in our current test coverage with respect to crash recovery

      This spike is intended to investigate and define a clear direction rather than delivery.

      Scope

      • Define what “crash recovery” means in the Disagg context.
      • Identify potential test candidates (timestamp_abort is likely the one) that could be extended to include crash/recovery testing for disagg.

            Assignee:
            Sid Mahajan
            Reporter:
            Sid Mahajan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: