-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Critical - P2
-
Affects Version/s: None
-
Component/s: Layered Tables
-
None
-
Storage Engines, Storage Engines - Foundations
-
SE Foundations - 2025-08-15
-
5
Initially this problem was reported by jonathan.reams@mongodb.com while on his feature branch that was mostly changing encryption code which is probably unrelated.
While I was trying to reproduce the initial failure, I successfully reproduced it on the `mongod-disagg-integration` branch without any additional changes.
The issue happens during the checkpoints pickup on the follower for the table that goes through the pickup process for the first time. The problem happens because while opening the ingest table we assign timestamp from the last system-wide checkpoint as the `btree->prune_timestamp` value, but while we picking up the checkpoint we set local stable table checkpoint timestamp as the new `prune_timestamp` which could be older than one that was set initially
The following code snippet briefly explains what's happening from the implementation perspective:
__schema_open_layered_ingest(...) { // Set `disaggregated_storage.last_checkpoint_timestamp` value as `btree->prune_timestamp` during initialization WT_ACQUIRE_READ(ingest_btree->prune_timestamp, S2C(session)->disaggregated_storage.last_checkpoint_timestamp); } __layered_update_gc_ingest_tables_prune_timestamps(...) { // Obtain last checkpoint number from the stable table metadata __layered_last_checkpoint_order(session, layered_table->stable_uri, &last_ckpt) // `ckpt_inuse` always <= `last_ckpt` according to the current logic if (ckpt_inuse != layered_table->last_ckpt_inuse) { // Failing assertion !!! // `btree->prune_timestamp` == `disaggregated_storage.last_checkpoint_timestamp` at the first pass // `prune_timestamp` <= current stable table `last_ckpt` timestamp WT_ASSERT(session, prune_timestamp >= btree->prune_timestamp); // Timestamp update WT_RELEASE_WRITE(btree->prune_timestamp, prune_timestamp); } }
After discussing this with peter.macko@mongodb.com we came to the conclusion that there might be a bigger problem, as it seems we cannot rely on checkpoint numbers when pruning or using global checkpoint order `conn->disaggregated_storage->ckpt_track[0].ckpt_order` since checkpoints order is not global and different for every table. That requires further discussion.
It might also be suboptimal to store all the checkpoints in an array and iterate through them for every table during each checkpoint pickup.
- is related to
-
WT-15188 Disable disagg hook for live restore python tests
-
- Closed
-
-
WT-15167 Improve the usage of __wt_page_block_meta
-
- Closed
-
- related to
-
WT-15192 Incorrect comparison between local table and metadata checkpoint orders during pruning
-
- Open
-
-
WT-15191 Write a regression test for WT-15158
-
- Open
-
-
WT-15017 precise checkpoint config should not be in the configure for checkpoint server and reconfigurable
-
- Closed
-
-
WT-15096 Keep our compatibility testing up-to-date with the latest branches
-
- Closed
-