-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Checkpoints
-
None
-
Storage Engines - Persistence
-
SE Persistence backlog
-
None
Issue Summary
When a checkpoint failure occurs in __wt_block_checkpoint_resolve, the system triggers WT_PANIC with a hardcoded error code EINVAL. This results in logs containing only the generic message: "the checkpoint failed, the system must restart." This lack of detailed error context makes it difficult to diagnose the underlying cause of the failure.
Context
- The checkpoint failure path currently does not log the actual error or context, only the generic EINVAL code.
- This was observed in a HELP ticket, where troubleshooting was hindered by insufficient log details.
- Discussion suggests that for cases like WT_CKPT_INPROGRESS, errors are recoverable and handled, but for WT_CKPT_PANIC_ON_FAILURE, knowing the actual ret value and more context would be valuable.
Proposed Solution
- Enhance logging in __wt_block_checkpoint_resolve to include:
- The actual error code returned (ret value)
- Additional context about where and how the checkpoint failed
- Ensure that when WT_PANIC is triggered, the logs provide actionable information for debugging.
Original Slack thread
This ticket was generated by AI from a Slack thread.