No-timestamp tombstones in ingest btree causing assertion failure

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Critical - P2
    • None
    • Affects Version/s: None
    • Component/s: Layered Tables, Truncate
    • None
    • Storage Engines - Foundations
    • 299.107
    • None
    • None

      __layered_clear_ingest_table writes transactional no-timestamp tombstones that eviction cannot safely reconcile on the ingest btree, causing __rec_fill_tw_from_upd_select assertion failure (vpack=NULL) in AF-17117 and WT-17354.

      Problem

      During follower-to-leader step-up, __layered_clear_ingest_table is called after __layered_copy_ingest_table has drained ingest content into the stable table. The clear function uses a transactional truncate() to wipe the ingest btree:

      /*
       * __layered_clear_ingest_table --
       *     After ingest content has been drained to the stable table, clear out the ingest table.
       */
      static int
      __layered_clear_ingest_table(WT_SESSION_IMPL *session, const char *uri)
      {
          WT_ASSERT(session, WT_URI_IS_INGEST(uri));
      
          /*
           * Truncate needs a running txn. We should probably do something more like the history store and
           * make this non-transactional -- this happens during step-up, so we know there are no other
           * transactions running, so it's safe.
           */
          WT_RET(__wt_txn_begin(session, NULL));
      
          /*
           * No other transactions are running, we're only doing this truncate, and it should become
           * immediately visible. So this transaction doesn't have to care about timestamps.
           */
          F_SET(session->txn, WT_TXN_TS_NOT_SET);
      
          WT_RET(session->iface.truncate(&session->iface, uri, NULL, NULL, NULL));
      
          WT_RET(__wt_txn_commit(session, NULL));
      
          return (0);
      }
      

      This produces a WT_UPDATE_TOMBSTONE for every key in the ingest btree with a transaction ID but no timestamp.

      This is confirmed by the verbose log output from the crash:

      int __rec_fill_tw_from_upd_select(WT_SESSION_IMPL *, WT_PAGE *, WT_CELL_UNPACK_KV *, WTI_UPDATE_SELECT *, _Bool, WTI_RECONCILE *, WT_UPDATE *):1486:WiredTiger assertion failed: '(vpack != ((void*)0) && vpack->type != (4 << 4))'. No on-disk value is found
      
      update[0]: type=TOMBSTONE txnid=1549215 start_ts=(0, 0) durable_ts=(0, 0) prepare_ts=(0, 0) prepared_id=0 prepare_state=0 flags=0x0

      Writes to disk through I/O operations (such as eviction or checkpointing) can still occur in parallel during step-up. Eviction threads are therefore not blocked from touching the ingest btree while the clear is running, before the tombstone is globally visible. __rec_fill_tw_from_upd_select() is called with no on-disk backing cell vpack == NULL, causing a crash.

      Here is an evergreen patch with verbose logging of the issue. Full logs are also attached to the ticket.

      Proposed Fix

      Bypass the transaction entirely so tombstones have no txnid and are immediately globally visible to all eviction threads. Writing tombstones with txnid = WT_TXN_NONE makes __wt_txn_upd_visible_all() return true unconditionally, regardless of any concurrent reader's oldest ID.

      Alternatively, instead of truncating in place, drop the btree and recreate it empty after draining, which eliminates eviction issues and bypasses reconciliation.

       

            Assignee:
            Sid Mahajan
            Reporter:
            Alana Huang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: