Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Checkpoints
Labels:
- code-quality

Assigned Teams:

Storage Engines, Storage Engines - Persistence
Sprint:
StorEng - Refinement Pipeline
Story Points:
None

Generally, WiredTiger checkpoints are expected not to fail after a certain point. There is a point of no return before which the application can ignore a checkpoint failure and continue. In practice, most of the checkpoint errors are fatal. Either WiredTiger is expected to panic or roll back and return a panic to the application.

In ~~WT-10989~~ I made several changes to the checkpoint's block manager. When a checkpoint is configured to also switch the underlying files and flush the previous files to the next tier, failures are hard to handle. If we have switched the underlying file and then the checkpoint fails at a later stage, ideally we should switch back to the pervious file. What if the newer file already starts getting writes in the meanwhile?

Though we already expect that the existing checkpoint code itself treats errors at this stage as non-recoverable, I made sure to panic the system in case there is an error and the checkpoint is configured to flush the files.

I am filing this ticket for the team to have a discussion around the existing checkpoint failure handling and to brainstorm if we can gracefully handle tier flush errors, or if it is safer to just panic as proposed by ~~WT-10989~~.

related to

WT-10989 Implement precise coordination of checkpoint and flush for tiered tables

Closed

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Sulabh Mahajan
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Nov 28 2023 07:48:22 AM UTC
Updated:: Mar 21 2025 12:28:46 AM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates