Fix the potential data race between open btree or open dhandle and primary step down

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines - Foundations
    • None
    • None

      Step-down sweep: __disagg_mark_btrees_readonly_then_step_down() (conn_layered.c:1167) iterates all open disagg btrees and marks them WT_BTREE_READONLY before setting conn->layered_table_manager.leader = false.
      The Race Condition:
      Thread A (user operation)              Thread B (step-down via reconfigure)
      ─────────────────────────              ────────────────────────────────────
      Checks leader == true
      → Constructs URI without checkpoint
        suffix (leader path)
      → dhandle alloc completes
        (HANDLE_LIST_WRITE_LOCK released)
      → Enters __wt_btree_open()...
        dhandle does not yet have
        WT_DHANDLE_OPEN flag set
                                             __disagg_step_down():
                                               Acquires checkpoint_lock
                                               WT_WITH_HANDLE_LIST_READ_LOCK:
                                                 Iterates all dhandles
                                                 → Thread A's dhandle is skipped
                                                   because WT_DHANDLE_OPEN is not
                                                   yet set (conn_layered.c:1183)
                                                 → All other open disagg btrees
                                                   are marked WT_BTREE_READONLY
                                               conn->leader = false
                                               Releases locks
      → __wt_btree_open() continues:
        - Not a checkpoint dhandle
          → WT_DHANDLE_IS_CHECKPOINT is false
        - No code checks leader == false
          in the btree open path
        - F_SET(dhandle, WT_DHANDLE_OPEN)
      Result: A disagg btree is now open in
      read-write mode on a follower node.
      Why existing protections are insufficient:
      - The step-down sweep at conn_layered.c:1183 filters on F_ISSET(dhandle, WT_DHANDLE_OPEN), so it skips any dhandle that is in the middle of being opened.
      - The step-down holds WT_WITH_HANDLE_LIST_READ_LOCK, which blocks new dhandle allocation (which requires the write lock), but does not block __wt_btree_open() / __wt_conn_dhandle_open() since those do not hold the handle list lock.
      - The step-down holds checkpoint_lock. The follower's stable btree open path also acquires checkpoint_lock (session_dhandle.c:968-983), but since Thread A entered via the leader path (it checked leader == true before step-down occurred), it may not require checkpoint_lock at all.
      - __wt_btree_open() (bt_handle.c) has no code that checks conn->layered_table_manager.leader to set WT_BTREE_READONLY. The readonly flag is only set based on WT_DHANDLE_IS_CHECKPOINT, WT_BTREE_VERIFY, WT_CONN_READONLY, the "readonly" metadata config key, or the disagg checkpoint suffix path — none of which apply here.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Shoufu Du
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: