Fix reconciliation leaking overflow pages

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • None
    • Storage Engines, Storage Engines - Transactions
    • SE Transactions - 2025-11-07
    • 5
    • v8.2, v8.0, v7.0

      Reconciliation of pages containing overflow keys can leak overflow pages when page split fails during a bulk insert.

      Error Signature

      This can be detected by running verify on the table. Verify will complain that the ranges are not verified because their address ranges remain in the allocated extent list.

      WT_SESSION.verify: [WT_VERB_DEFAULT][ERROR]: __verify_ckptfrag_chk, 526: checkpoint ranges never verified: 1
      

      Steps

      1. Bulk cursor inserts k/v pairs, writing any pages for any key that becomes an overflow item
      2. The page size reaches the split ratio
      3. Reconciliation begins to split the page
      4. The page split returns EBUSY (this can happen for a multitude of reasons)
      5. The overflow page is orphaned as the error path in the page split does not account for any overflow pages already written

      Stack

      __curbulk_insert_row()
            │
            ▼
      __wt_bulk_insert_row()
      ↳  Internal calls: 
      →  __rec_cell_build_leaf_key()
      →  __wti_rec_cell_build_ovfl()
      →   __rec_write()
            │
            ▼
      __wti_rec_split_crossing_bnd()
            │
            ▼
      __wti_rec_split()
            │
            ▼
      __rec_split_write()
      ↳  Result: returns EBUSY
      

      Reproducer

      test_ovfl01.py follows the steps above and the wt-15739.diff introduces a failpoint to return EBUSY during the __rec_split_write

      DIscussion

      The failpoint in the diff is a bit crude and doesn't narrow down the reasons for what could be causing an EBUSY at this time. I suspect the checkpoint check just below could be a reason, however, I've not yet been able to hit that in testing.

          if (!last_block && __wt_btree_syncing_by_other_session(session)) {
              WT_STAT_CONN_DSRC_INCR(
                session, cache_eviction_blocked_multi_block_reconciliation_during_checkpoint);
              return (__wt_set_return(session, EBUSY));
      

      The bulk load path does not use the overflow page tracking logic.

              /*
               * Track the overflow record (unless it's a bulk load, which by definition won't ever reuse
               * a record.
               */
              if (!r->is_bulk_load)
                  WT_ERR(__wti_ovfl_reuse_add(session, page, addr, size, kv->buf.data, kv->buf.size));
      

      The normal reconcilation path will look like:

      __ovfl_reuse_wrapup_err
      __wti_ovfl_track_wrapup_err
      __rec_write_err
      __reconcile -> (ret = EBUSY)
      __wt_reconcile
      __evict_reconcile
      __wt_evict
      __evict_page
      __evict_lru_pages
      __evict_pass
      __evict_server
      __evict_thread_run
      __thread_run
      

      Where __ovfl_reuse_wrapup_err will clean up the newly written overflow pages.

        1. test_ovfl01.py
          3 kB
        2. wt-15739.diff
          21 kB

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Sean Watt
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: