-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Checkpoints, Reconciliation
-
None
-
Storage Engines, Storage Engines - Persistence, Storage Engines - Transactions
-
SE Persistence backlog
-
None
In the disaggregated block write path, bytes_total is incremented inside _wti_block_disagg_write_internal (block_disagg_write.c:216) immediately after plh_put succeeds. However, the caller wti_block_disagg_write still has work to do after that point – specifically _wti_block_disagg_addr_pack, which can fail (e.g., the WT_ASSERT_ALWAYS at block_disagg_addr.c:105-106 checking cookie->lsn > cookie->base_lsn).
If addr packing fails, the error propagates up through reconciliation and is handled by the non-panic error path in _rec_write (rec_write.c:372-382). This path calls _rec_write_err for cleanup, but because the failure occurred before multi->addr.block_cookie was set, the cleanup cannot free the written block or roll back the bytes_total increment. The result is:
- The page is written to the page service and is never discarded (storage leak).
- bytes_total is permanently inflated by the size of the orphaned block (accounting leak).
- The system continues running (no panic), so the leak persists and compounds.
The root cause is that the bytes_total increment sits on the wrong side of the panic boundary. Reconciliation has two error regimes:
- Before __rec_write_wrapup: failures are recoverable (non-panic return at rec_write.c:381).
- After __rec_write_wrapup: failures panic (rec_write.c:440-442).
The bytes_total increment currently happens during image building (before wrapup), so a failure leaves accounting in an inconsistent state that the system must live with.
Proposed Fix
- Remove _wt_btree_increase_size from _wti_block_disagg_write_internal.
- Accumulate written bytes on the reconciliation struct (e.g., r->disagg_bytes_written) during __rec_split_write, after each successful block write.
- Apply the accumulated increment in _rec_write after the non-panic error check passes but before rec_write_wrapup, so that bytes_total is only committed once we are past the point of no return. This also ensures the increment is visible to _bmd_checkpoint_pack_raw, which reads bytes_total during wrapup.
Risk
Low. The only failure point between the plh_put and a successful return is __wti_block_disagg_addr_pack, which can only fail under a logic bug (LSN ordering violation) or an unusual packing edge case. Normal operation will not trigger this. However, if it does occur, the leak is silent and permanent.