Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-10824

Create a tool to automatically parse and categorize checksum mismatch failures

    • Type: Icon: Improvement Improvement
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Asparagus-StorEng - 2023-10-31

      We occasionally see checksum mismatch failures – both in our internal testing and in the field. These errors are typically either an issue in the underlying hardware/software (WT-10509) or fleetingly rare and unreproducible (WT-10693).

      Since data integrity is extremely important, it would be useful to collect data about as many of these failures as possible to see of we can identify common patterns and/or causes (whether in WT or in the systems underneath us).

      The proposal here is to add further processing after detecting a checksum mismatch to provide more information about what went wrong:

      • Did we read data that looks like valid WT data?
      • Did we read data that is at a different location in the file?
      • Does retrying the read produce the correct data?
      • Does the data look like garbage?
      • Does the data look "almost" right – i.e., could it be correct except for one or two flipped bits?
      • Etc.

      We could do this with a tool, perhaps extending wt_binary_decode.py that post-processes the failure data in conjunction with the underlying data file, or we could extend the WiredTiger library to perform this analysis.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: