Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Not Applicable
Security Level: Public (Available to anyone on the web)
Labels:
- Disag_Storage
- lc_bulk_04_29_26

Assigned Teams:

Storage Engines - Foundations
Total Hours with Assigned Team:
1,991.275
Epic Link:
SPM-4271
Sprint:
SE Foundations - Q4+ Backlog
Story Points:
None

Step-down sweep: __disagg_mark_btrees_readonly_then_step_down() (conn_layered.c:1167) iterates all open disagg btrees and marks them WT_BTREE_READONLY before setting conn->layered_table_manager.leader = false.
The Race Condition:
Thread A (user operation)              Thread B (step-down via reconfigure)
─────────────────────────              ────────────────────────────────────
Checks leader == true
→ Constructs URI without checkpoint
  suffix (leader path)
→ dhandle alloc completes
  (HANDLE_LIST_WRITE_LOCK released)
→ Enters __wt_btree_open()...
  dhandle does not yet have
  WT_DHANDLE_OPEN flag set
                                       __disagg_step_down():
                                         Acquires checkpoint_lock
                                         WT_WITH_HANDLE_LIST_READ_LOCK:
                                           Iterates all dhandles
                                           → Thread A's dhandle is skipped
                                             because WT_DHANDLE_OPEN is not
                                             yet set (conn_layered.c:1183)
                                           → All other open disagg btrees
                                             are marked WT_BTREE_READONLY
                                         conn->leader = false
                                         Releases locks
→ __wt_btree_open() continues:
  - Not a checkpoint dhandle
    → WT_DHANDLE_IS_CHECKPOINT is false
  - No code checks leader == false
    in the btree open path
  - F_SET(dhandle, WT_DHANDLE_OPEN)
Result: A disagg btree is now open in
read-write mode on a follower node.
Why existing protections are insufficient:
- The step-down sweep at conn_layered.c:1183 filters on F_ISSET(dhandle, WT_DHANDLE_OPEN), so it skips any dhandle that is in the middle of being opened.
- The step-down holds WT_WITH_HANDLE_LIST_READ_LOCK, which blocks new dhandle allocation (which requires the write lock), but does not block __wt_btree_open() / __wt_conn_dhandle_open() since those do not hold the handle list lock.
- The step-down holds checkpoint_lock. The follower's stable btree open path also acquires checkpoint_lock (session_dhandle.c:968-983), but since Thread A entered via the leader path (it checked leader == true before step-down occurred), it may not require checkpoint_lock at all.
- __wt_btree_open() (bt_handle.c) has no code that checks conn->layered_table_manager.leader to set WT_BTREE_READONLY. The readonly flag is only set based on WT_DHANDLE_IS_CHECKPOINT, WT_BTREE_VERIFY, WT_CONN_READONLY, the "readonly" metadata config key, or the disagg checkpoint suffix path — none of which apply here.

related to

WT-16571 Failed: test_layered55.test_layered55.test_obsolete_time_window_palite on ~ RHEL8 zSeries [WiredTiger (develop) @ 0535a11d]

Closed

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Shoufu Du
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Mar 10 2026 04:39:51 AM UTC
Updated:: May 04 2026 11:35:23 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates