Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: WT12.0.0, 8.3.0-rc0
Affects Version/s: None
Component/s: Layered Tables
Labels:
None

Assigned Teams:

Storage Engines, Storage Engines - Foundations
Sprint:
SE Foundations - 2025-08-15
Story Points:
5

Initially this problem was reported by jonathan.reams@mongodb.com while on his feature branch that was mostly changing encryption code which is probably unrelated.

While I was trying to reproduce the initial failure, I successfully reproduced it on the `mongod-disagg-integration` branch without any additional changes.

The issue happens during the checkpoints pickup on the follower for the table that goes through the pickup process for the first time. The problem happens because while opening the ingest table we assign timestamp from the last system-wide checkpoint as the `btree->prune_timestamp` value, but while we picking up the checkpoint we set local stable table checkpoint timestamp as the new `prune_timestamp` which could be older than one that was set initially

The following code snippet briefly explains what's happening from the implementation perspective:

__schema_open_layered_ingest(...) {
    // Set `disaggregated_storage.last_checkpoint_timestamp` value as `btree->prune_timestamp` during initialization
    WT_ACQUIRE_READ(ingest_btree->prune_timestamp, S2C(session)->disaggregated_storage.last_checkpoint_timestamp);
    
} 


__layered_update_gc_ingest_tables_prune_timestamps(...) {
    // Obtain last checkpoint number from the stable table metadata
    __layered_last_checkpoint_order(session, layered_table->stable_uri, &last_ckpt)
     
     // `ckpt_inuse` always <= `last_ckpt` according to the current logic
     if (ckpt_inuse != layered_table->last_ckpt_inuse) {
         // Failing assertion !!!
         // `btree->prune_timestamp` == `disaggregated_storage.last_checkpoint_timestamp` at the first pass
         // `prune_timestamp` <= current stable table `last_ckpt` timestamp
         WT_ASSERT(session, prune_timestamp >= btree->prune_timestamp);
         // Timestamp update
         WT_RELEASE_WRITE(btree->prune_timestamp, prune_timestamp);
     }
}

After discussing this with peter.macko@mongodb.com we came to the conclusion that there might be a bigger problem, as it seems we cannot rely on checkpoint numbers when pruning or using global checkpoint order `conn->disaggregated_storage->ckpt_track[0].ckpt_order` since checkpoints order is not global and different for every table. That requires further discussion.

It might also be suboptimal to store all the checkpoints in an array and iterate through them for every table during each checkpoint pickup.

is related to

WT-15188 Disable disagg hook for live restore python tests

Closed

WT-15167 Improve the usage of __wt_page_block_meta

Closed

related to

WT-15192 Incorrect comparison between local table and metadata checkpoint orders during pruning

Closed

WT-14365 Create performance tests to measure eviction efficiency

Closed

WT-14996 Set base and backlink LSNs for page discard (tombstone write)

Closed

WT-15003 Remove the WT_DELTA_CELL_INT structure and pack internal page deltas with WT_CELL

Closed

WT-15017 precise checkpoint config should not be in the configure for checkpoint server and reconfigurable

Closed

WT-15191 Write a regression test for WT-15158

Closed

WT-15096 Keep our compatibility testing up-to-date with the latest branches

Closed

(4 related to)

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Ivan Kochin
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Aug 08 2025 08:17:02 AM UTC
Updated:: Aug 24 2025 11:48:49 PM UTC
Resolved:: Aug 14 2025 01:23:39 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates