-
Type:
New Feature
-
Resolution: Fixed
-
Priority:
Minor - P4
-
Affects Version/s: None
-
Component/s: None
-
StorEng - Refinement Pipeline
Summary
Create a tool that we can use after a checksum mismatch to try to parse the incorrect data and, if it is recognizable, tell us what it is.
Motivation
Today when WiredTiger sees a checksum mismatch during a read, it prints a hex dump of the contents of the incorrect block and panics. The dump is almost always useless because the typical engineer has no way to figure out what that data is.
If we could easily determine what data is in a "corrupt" block, it might help in diagnosing the underlying problem.
- If the block unrecognizable garbage, it would rule out an error in higher level WT code, and point the finger at something going wrong in the OS or storage system below WiredTiger.
- If the block contains recognizable WiredTiger data, it could be useful in debugging. How did that block getting written to the wrong place? Or why didn't this get overwritten with the correct data?
Suggested Solution
We have code in salvage that will walk through a file trying to find recognizable WiredTiger blocks. We could build on that so that after a checksum mismatch we take a section of the file around the mismatch and use the salvage code to find and print information about any recognizable blocks that overlap with the failed read.