Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Layered Tables
Labels:
None

Assigned Teams:

Storage Engines - Foundations
Total Hours with Assigned Team:
214.301
Sprint:
None
Story Points:
None
Linked BFG List:
https://buildbaron.corp.mongodb.com/ui/#/bf/BF-42866

Summary

During follower-to-leader step-up, the drain worker (_layered_copy_ingest_table) can race with a concurrent session->drop(force=true, checkpoint_wait=false) on the same layered table. The drop removes the ingest and stable backing files while the drain worker holds open cursors on them, or vice versa. The cursor open fails with ENOENT, which propagates up through disagg_step_up to {}wti_disagg_conn_config, which calls {}_wt_panic and aborts the process.

Root Cause

The drain worker dequeues a work item and begins copying the ingest table to stable before any exclusive lock is held on the dhandle. The drop path (__drop_layered) proceeds to remove the backing files without checking whether a drain is actively copying them. Neither path coordinates with the other, leaving a window where the drain worker opens cursors on files that no longer exist.

Proposed Solutions

Three approaches are under evaluation:

Approach 1: Read lock on ingest dhandle during drain. The drain worker acquires a read lock on the ingest dhandle's rwlock before opening cursors, and checks WT_DHANDLE_DEAD immediately after. The drop path already acquires an exclusive write lock via __wt_session_get_dhandle(WT_DHANDLE_EXCLUSIVE), so this serializes the two paths correctly. A secondary fix clamps database_size to WT_DISAGG_CHECKPOINT_SIZE_BUFFER to prevent a diagnostic assertion when a concurrent drop causes an apparent net-negative size delta.

Approach 2: Translate ENOENT to WT_NOTFOUND and skip. At both cursor opens in __layered_copy_ingest_table, ENOENT is converted to WT_NOTFOUND and returned to the worker, which treats it as a benign "table already gone" signal and skips the copy cleanly. Same database_size clamp as Approach 1.

Approach 3: Queue scan in {}drop_layered. Before removing files, the drop path acquires the drain queue lock and inspects the work queue. If the target table's entry is still queued, it is removed (superseding the drain). If the drain worker has already dequeued the entry and is actively copying, EBUSY is returned so the schema lock is released and the caller retries after the drain finishes. No database_size clamp required since the drop either races with nothing or waits for the copy to complete.

Assignee:: Alexander Pullen
Reporter:: Alexander Pullen
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: May 08 2026 04:51:45 AM UTC
Updated:: May 08 2026 05:30:51 AM UTC

Details

Description

Attachments

Activity

People

Dates