Fix race between layered drop and drain during step-up

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Layered Tables
    • None

      Summary

      During follower-to-leader step-up, the drain worker (_layered_copy_ingest_table) can race with a concurrent session->drop(force=true, checkpoint_wait=false) on the same layered table. The drop removes the ingest and stable backing files while the drain worker holds open cursors on them, or vice versa. The cursor open fails with ENOENT, which propagates up through disagg_step_up to {}wti_disagg_conn_config, which calls {}_wt_panic and aborts the process.

      Root Cause

      The drain worker dequeues a work item and begins copying the ingest table to stable before any exclusive lock is held on the dhandle. The drop path (__drop_layered) proceeds to remove the backing files without checking whether a drain is actively copying them. Neither path coordinates with the other, leaving a window where the drain worker opens cursors on files that no longer exist.

      Proposed Solutions

      Three approaches are under evaluation:

      Approach 1: Read lock on ingest dhandle during drain. The drain worker acquires a read lock on the ingest dhandle's rwlock before opening cursors, and checks WT_DHANDLE_DEAD immediately after. The drop path already acquires an exclusive write lock via __wt_session_get_dhandle(WT_DHANDLE_EXCLUSIVE), so this serializes the two paths correctly. A secondary fix clamps database_size to WT_DISAGG_CHECKPOINT_SIZE_BUFFER to prevent a diagnostic assertion when a concurrent drop causes an apparent net-negative size delta.

      Approach 2: Translate ENOENT to WT_NOTFOUND and skip. At both cursor opens in __layered_copy_ingest_table, ENOENT is converted to WT_NOTFOUND and returned to the worker, which treats it as a benign "table already gone" signal and skips the copy cleanly. Same database_size clamp as Approach 1.

      Approach 3: Queue scan in {}drop_layered. Before removing files, the drop path acquires the drain queue lock and inspects the work queue. If the target table's entry is still queued, it is removed (superseding the drain). If the drain worker has already dequeued the entry and is actively copying, EBUSY is returned so the schema lock is released and the caller retries after the drain finishes. No database_size clamp required since the drop either races with nothing or waits for the copy to complete.

            Assignee:
            Alexander Pullen
            Reporter:
            Alexander Pullen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: