Step-down sweep: __disagg_mark_btrees_readonly_then_step_down() (conn_layered.c:1167) iterates all open disagg btrees and marks them WT_BTREE_READONLY before setting conn->layered_table_manager.leader = false.
The Race Condition:
Thread A (user operation) Thread B (step-down via reconfigure)
───────────────────────── ────────────────────────────────────
Checks leader == true
→ Constructs URI without checkpoint
suffix (leader path)
→ dhandle alloc completes
(HANDLE_LIST_WRITE_LOCK released)
→ Enters __wt_btree_open()...
dhandle does not yet have
WT_DHANDLE_OPEN flag set
__disagg_step_down():
Acquires checkpoint_lock
WT_WITH_HANDLE_LIST_READ_LOCK:
Iterates all dhandles
→ Thread A's dhandle is skipped
because WT_DHANDLE_OPEN is not
yet set (conn_layered.c:1183)
→ All other open disagg btrees
are marked WT_BTREE_READONLY
conn->leader = false
Releases locks
→ __wt_btree_open() continues:
- Not a checkpoint dhandle
→ WT_DHANDLE_IS_CHECKPOINT is false
- No code checks leader == false
in the btree open path
- F_SET(dhandle, WT_DHANDLE_OPEN)
Result: A disagg btree is now open in
read-write mode on a follower node.
Why existing protections are insufficient:
- The step-down sweep at conn_layered.c:1183 filters on F_ISSET(dhandle, WT_DHANDLE_OPEN), so it skips any dhandle that is in the middle of being opened.
- The step-down holds WT_WITH_HANDLE_LIST_READ_LOCK, which blocks new dhandle allocation (which requires the write lock), but does not block __wt_btree_open() / __wt_conn_dhandle_open() since those do not hold the handle list lock.
- The step-down holds checkpoint_lock. The follower's stable btree open path also acquires checkpoint_lock (session_dhandle.c:968-983), but since Thread A entered via the leader path (it checked leader == true before step-down occurred), it may not require checkpoint_lock at all.
- __wt_btree_open() (bt_handle.c) has no code that checks conn->layered_table_manager.leader to set WT_BTREE_READONLY. The readonly flag is only set based on WT_DHANDLE_IS_CHECKPOINT, WT_BTREE_VERIFY, WT_CONN_READONLY, the "readonly" metadata config key, or the disagg checkpoint suffix path — none of which apply here.