Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0, 9.0.0-rc0
Affects Version/s: None
Component/s: Checkpoints
Security Level: Public (Available to anyone on the web)
Labels:
- lc_bulk_04_29_26

Assigned Teams:

Storage Engines - Persistence
Total Hours with Assigned Team:
2,679.036
Sprint:
SE Persistence backlog
Story Points:
None

In disaggregated storage we have a number of paths during checkpoint that are non-recoverable on error. In theory they could be but at this stage are not, and an in depth analysis to determine how to recover, and what other paths need recovery is expensive.

To avoid the likelihood of data corruption we should panic for the time being.

The first path of concern is the shared metadata queue drain, which on error will have mutated the original queue and has no path to restore it.

Some other paths I have concerns about are:

_wt_disagg_put_checkpoint_meta calls _disagg_put_page, can that be rolled back
When starting a new checkpoitn _disagg_begin_checkpoint is called but there's no call to abandon_checkpoint if we fail the checkpoint. Instead we set ckpt_success=false and call into _wt_disagg_advance_checkpoint which will begin a subsequent checkpoint.
plh_put called from root page writes in __wti_block_disagg_write_internal ? Can those be rolled back?

Scope:

On any error in disagg checkpoint throw a WT_PANIC.
Add a test.

related to

WT-18008 [Verify] race-condition-stress-asan-test-3 timeout: possible stall in new HS verify logic (__verify_key_hs) under ASAN

In Progress

WT-16711 Crash/Recovery timestamp_abort (disagg=leader) records absent in collections table

Closed

Assignee:: Sean Watt
Reporter:: Luke Pearson
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Mar 20 2026 08:50:39 PM UTC
Updated:: Jul 07 2026 06:38:40 AM UTC
Resolved:: May 08 2026 12:33:15 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates