Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0, 9.0.0-rc0
Affects Version/s: None
Component/s: Metadata
Security Level: Public (Available to anyone on the web)
Labels:
- expedite
- lc_bulk_04_29_26

Assigned Teams:

Storage Engines - Foundations
Total Hours with Assigned Team:
963.119
Sprint:
None
Story Points:
5
Evergreen Project:
- wiredtiger
Linked BFG List:
https://buildbaron.corp.mongodb.com/ui/#/bf/WT-17103
Count of Linked BFGs (Last 30 days):
9

Problem
This issue was introduced after ~~WT-15527~~, where we began picking up checkpointed shared metadata instead of live shared metadata during checkpoint pickup.

During the statlog walk (component that collects stats), a shared metadata checkpoint handle may be selected. Before the handle is used, the sweep server may mark it as dead (as it had been unusable for n seconds). When statlog later tries to use the handle, it attempts to reopen it. This requires acquiring the checkpoint lock, triggering the following deadlock assertion:

        WT_ASSERT_ALWAYS(session,
          !FLD_ISSET(session->lock_flags, WT_SESSION_LOCKED_SCHEMA) ||
            FLD_ISSET(session->lock_flags, WT_SESSION_LOCKED_CHECKPOINT),
          "deadlock");

and causing an abort.

Root cause
Although the statlog walk includes a filter to skip dead dhandles (marked by the sweep server), there is a race condition between the filter check and the actual use of the handle.

A handle can pass the filter as “alive,” but then be marked dead by the sweep server before it is used. As a result, __conn_btree_apply_internal encounters a dead handle and attempts to reopen it.

Reopening the handle requires re-reading it into the cache, which in turn requires both the schema and checkpoint locks. However, only the schema lock is held in this path, leading to the assertion failure.

On a follower, this situation is handled by detecting the state and acquiring the necessary locks. On the leader, this additional locking does not occur, which exposes the issue.

is duplicated by

WT-17112 failed: format-stress-test-disagg-leader-data-validation-1 on ubuntu2004-stress-nonstandalone [wiredtiger @ 4a5646cb]

Closed

is related to

WT-15527 We should open the checkpoint of shared metadata in the follower

Closed

WT-16837 Investigate whether the stat log server should process ingest tables on leader

Open

Assignee:: Sid Mahajan
Reporter:: Sid Mahajan
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Apr 07 2026 12:34:56 AM UTC
Updated:: May 04 2026 11:36:05 PM UTC
Resolved:: Apr 13 2026 04:32:54 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates