-
Type: Improvement
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Asparagus-StorEng - 2023-10-31
We occasionally see checksum mismatch failures – both in our internal testing and in the field. These errors are typically either an issue in the underlying hardware/software (WT-10509) or fleetingly rare and unreproducible (WT-10693).
Since data integrity is extremely important, it would be useful to collect data about as many of these failures as possible to see of we can identify common patterns and/or causes (whether in WT or in the systems underneath us).
The proposal here is to add further processing after detecting a checksum mismatch to provide more information about what went wrong:
- Did we read data that looks like valid WT data?
- Did we read data that is at a different location in the file?
- Does retrying the read produce the correct data?
- Does the data look like garbage?
- Does the data look "almost" right – i.e., could it be correct except for one or two flipped bits?
- Etc.
We could do this with a tool, perhaps extending wt_binary_decode.py that post-processes the failure data in conjunction with the underlying data file, or we could extend the WiredTiger library to perform this analysis.
- duplicates
-
WT-11177 Create tooling to detect and analyse checksum errors
- Open