Extend crash testing framework with a configurable background thread for randomized crash points

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines, Storage Engines - Foundations, Storage Engines - Persistence
    • SE Foundations - Q3+ Backlog
    • 13

      Currently, WiredTiger supports crash testing during checkpoints using the WT_SESSION.checkpoint.checkpoint_crash_point field. While this feature has been valuable for reproducing complex bugs involving crashes, it is limited in scope and requires adding crash-specific code to trigger a crash at a specific time. Furthermore, it only crashes between tables and not while checkpoint is a processing a table.

      To expand crash testing capabilities and improve the reproducibility of issues across various subsystems, I propose implementing a generic background thread designed for crash testing.

      Key Features:

      • Configurable activation:
        • A new connection configuration field would enable/disable the crash testing thread.
      • Function pointer for crash conditions:
        • The thread would be assigned a user-defined function pointer to determine appropriate crash conditions dynamically.
      • Subsystem-specific crash scenarios:
        • For crash testing in the checkpoint subsystem, the thread could periodically check conditions like whether a checkpoint has started, and based on those, trigger crashes after a timer or other rules.
        • Conditions could be time-based, event-driven, or a mix of both, depending on testing requirements.
      • Extensibility:
        • The implementation would lay the groundwork for crash scenarios across other subsystems without requiring intrusive code changes in each subsystem.

      Benefits:

      • Increased coverage for crash testing, exposing edge cases across multiple subsystems.
      • Greater flexibility in designing crash scenarios.
      • Simplified implementation, reducing the need for subsystem-specific crash code.

      Example Use Case:
      In the checkpoint subsystem, the thread could detect when a checkpoint begins and schedule a crash at a random point within that process or based on specific stages/events during the checkpoint lifecycle.

      Next Steps:

      • Get this reviewed by the team for feasibility.
      • Define the connection configuration field to toggle the crash thread.
      • Implement the background thread and the mechanism for handling the function pointer for crash logic.
      • Evaluate initial subsystem targets (e.g., checkpoints) for crash testing expansion.

              Assignee:
              [DO NOT USE] Backlog - Storage Engines Team
              Reporter:
              Etienne Petrel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: