Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- dev-prod
- supportability

Sprint:
Asparagus-StorEng - 2023-10-31
Story Points:
None

We occasionally see checksum mismatch failures – both in our internal testing and in the field. These errors are typically either an issue in the underlying hardware/software (~~WT-10509~~) or fleetingly rare and unreproducible (~~WT-10693~~).

Since data integrity is extremely important, it would be useful to collect data about as many of these failures as possible to see of we can identify common patterns and/or causes (whether in WT or in the systems underneath us).

The proposal here is to add further processing after detecting a checksum mismatch to provide more information about what went wrong:

Did we read data that looks like valid WT data?
Did we read data that is at a different location in the file?
Does retrying the read produce the correct data?
Does the data look like garbage?
Does the data look "almost" right – i.e., could it be correct except for one or two flipped bits?
Etc.

We could do this with a tool, perhaps extending wt_binary_decode.py that post-processes the failure data in conjunction with the underlying data file, or we could extend the WiredTiger library to perform this analysis.

duplicates

WT-11177 Create documentation on analysing checksum errors

In Code Review

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Keith Smith
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Mar 27 2023 07:30:30 PM UTC
Updated:: Jan 12 2024 08:01:24 PM UTC
Resolved:: Oct 27 2023 08:15:38 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates