bytes_total increment not protected by reconciliation panic boundary

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints, Reconciliation
    • None
    • Storage Engines, Storage Engines - Persistence, Storage Engines - Transactions
    • SE Persistence backlog
    • None

      In the disaggregated block write path, bytes_total is incremented inside _wti_block_disagg_write_internal (block_disagg_write.c:216) immediately after plh_put succeeds. However, the caller wti_block_disagg_write still has work to do after that point – specifically _wti_block_disagg_addr_pack, which can fail (e.g., the WT_ASSERT_ALWAYS at block_disagg_addr.c:105-106 checking cookie->lsn > cookie->base_lsn).

      If addr packing fails, the error propagates up through reconciliation and is handled by the non-panic error path in _rec_write (rec_write.c:372-382). This path calls _rec_write_err for cleanup, but because the failure occurred before multi->addr.block_cookie was set, the cleanup cannot free the written block or roll back the bytes_total increment. The result is:

      1. The page is written to the page service and is never discarded (storage leak).
      2. bytes_total is permanently inflated by the size of the orphaned block (accounting leak).
      3. The system continues running (no panic), so the leak persists and compounds.

      The root cause is that the bytes_total increment sits on the wrong side of the panic boundary. Reconciliation has two error regimes:

      • Before __rec_write_wrapup: failures are recoverable (non-panic return at rec_write.c:381).
      • After __rec_write_wrapup: failures panic (rec_write.c:440-442).

      The bytes_total increment currently happens during image building (before wrapup), so a failure leaves accounting in an inconsistent state that the system must live with.

      Proposed Fix

      1. Remove _wt_btree_increase_size from _wti_block_disagg_write_internal.
      2. Accumulate written bytes on the reconciliation struct (e.g., r->disagg_bytes_written) during __rec_split_write, after each successful block write.
      3. Apply the accumulated increment in _rec_write after the non-panic error check passes but before rec_write_wrapup, so that bytes_total is only committed once we are past the point of no return. This also ensures the increment is visible to _bmd_checkpoint_pack_raw, which reads bytes_total during wrapup.

      Risk

      Low. The only failure point between the plh_put and a successful return is __wti_block_disagg_addr_pack, which can only fail under a logic bug (LSN ordering violation) or an unusual packing edge case. Normal operation will not trigger this. However, if it does occur, the leak is silent and permanent.

            Assignee:
            Zunyi Liu
            Reporter:
            Luke Pearson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: