__layered_update_gc_ingest_tables_prune_timestamps WT_ASSERT(session, prune_timestamp >= btree->prune_timestamp) failing while running hello_with_standby.js

XMLWordPrintableJSON

    • Storage Engines, Storage Engines - Foundations
    • SE Foundations - 2025-08-15
    • 5

      Initially this problem was reported by jonathan.reams@mongodb.com while on his feature branch that was mostly changing encryption code which is probably unrelated. 

      While I was trying to reproduce the initial failure, I successfully reproduced it on the `mongod-disagg-integration` branch without any additional changes. 

      The issue happens during the checkpoints pickup on the follower for the table that goes through the pickup process for the first time. The problem happens because while opening the ingest table we assign timestamp from the last system-wide checkpoint as the `btree->prune_timestamp` value, but while we picking up the checkpoint we set local stable table checkpoint timestamp as the new `prune_timestamp` which could be older than one that was set initially

      The following code snippet briefly explains what's happening from the implementation perspective:

      __schema_open_layered_ingest(...) {
          // Set `disaggregated_storage.last_checkpoint_timestamp` value as `btree->prune_timestamp` during initialization
          WT_ACQUIRE_READ(ingest_btree->prune_timestamp, S2C(session)->disaggregated_storage.last_checkpoint_timestamp);
          
      } 
      
      
      __layered_update_gc_ingest_tables_prune_timestamps(...) {
          // Obtain last checkpoint number from the stable table metadata
          __layered_last_checkpoint_order(session, layered_table->stable_uri, &last_ckpt)
           
           // `ckpt_inuse` always <= `last_ckpt` according to the current logic
           if (ckpt_inuse != layered_table->last_ckpt_inuse) {
               // Failing assertion !!!
               // `btree->prune_timestamp` == `disaggregated_storage.last_checkpoint_timestamp` at the first pass
               // `prune_timestamp` <= current stable table `last_ckpt` timestamp
               WT_ASSERT(session, prune_timestamp >= btree->prune_timestamp);
               // Timestamp update
               WT_RELEASE_WRITE(btree->prune_timestamp, prune_timestamp);
           }
      }

      After discussing this with peter.macko@mongodb.com we came to the conclusion that there might be a bigger problem, as it seems we cannot rely on checkpoint numbers when pruning or using global checkpoint order `conn->disaggregated_storage->ckpt_track[0].ckpt_order` since checkpoints order is not global and different for every table. That requires further discussion.  

      It might also be suboptimal to store all the checkpoints in an array and iterate through them for every table during each checkpoint pickup.

              Assignee:
              [DO NOT USE] Backlog - Storage Engines Team
              Reporter:
              Ivan Kochin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: