-
Type:
Sub-task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Tools
-
Security Level: Public (Available to anyone on the web)
-
None
-
Storage Engines - Persistence
-
1,113.112
-
SE Persistence backlog
-
None
The wt verify subcommand accepts a read_corrupt option to read past corrupt pages. We should extend this to dump and other subcommands that are useful for diagnostic purposes.
Background
wt verify accepts a flag that passes read_corrupt=true to session>verify(). This controls two behaviours in the verify path:
- Btree traversal continuation (bt_vrfy.c): when __wt_page_in fails on a child ref, verify logs the error and skips the subtree instead of aborting.
- Block I/O suppression: the verify path sets WT_BTREE_VERIFY on the btree, which causes block_io.c and related block readers to return an error code rather than panicking on checksum/decryption failures.
Other diagnostic commands (dump, read, stat, list, page, printlog) have no equivalent. When any of these encounters a corrupt page, the cursor iteration fails and the command aborts. This makes the wt CLI less useful for inspecting partially corrupt or disaggregated databases.
Goal
All read-oriented wt subcommands should be able to continue past corrupt pages, printing or skipping what they can recover. Subcommands in scope: dump, read, stat, list, page, printlog.
Implementation Considerations
There are two independent layers to address:
Layer 1 — Block I/O: WT_SESSION_QUIET_CORRUPT_FILE is the session flag checked in block_read.c, block_io.c, and block_disagg_read.c to suppress panics and return an error code instead. Setting this flag on the session before running a diagnostic command prevents crashes on corrupt block reads.
Layer 2 — Cursor iteration: Even with Layer 1 in place, a corrupt page causes cursor->next() / cursor->prev() to return an error, which current iteration loops (e.g. dump_all_records in util_dump.c) treat as fatal. Each command's iteration loop needs to distinguish between WT_NOTFOUND (end of data), WT_ERROR/EIO in read-corrupt mode (skip and continue), and other errors (abort). This mirrors the pattern in bt_vrfy.c where verify's traversal catches __wt_page_in errors and continues.
Implementation Options
Option A — Global CLI flag
Add a flag to util_main.c (e.g. a new free letter at the global level). After the session is opened, if the flag is set, call F_SET((WT_SESSION_IMPL *)session, WT_SESSION_QUIET_CORRUPT_FILE) before dispatching to the subcommand. Each targeted subcommand still needs its iteration loop updated to continue past errors (Layer 2).
- Pro: single point of change for the session flag; automatically applies to any current or future subcommand.
- Con: -c is already a per-command flag in wt verify, so a different letter must be chosen at the global level. Does not remove the need to update each command's iteration loop.
Option B — Per-subcommand flag (matches verify's existing -c pattern)
Add -c to each targeted subcommand. Each command sets WT_SESSION_QUIET_CORRUPT_FILE on the session and updates its iteration loop to continue past errors.
- Pro: consistent with wt verify -c; each command is self-contained.
- Con: changes required in 6+ files (util_dump.c, util_read.c, util_stat.c, util_list.c, util_page.c, util_printlog.c).
Test Plan
- Write a Python suite test that creates a table, corrupts a page on disk, then verifies that wt dump -c, wt read -c, and wt stat -c produce partial output and exit non-zero rather than crashing or producing no output.
- Confirm wt verify -c behaviour is unchanged.
- Confirm that without -c, commands still abort on the first corrupt page.