-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Reconciliation
-
Storage Engines - Transactions
-
79.246
-
SE Transactions - 2026-07-03
-
3
Context
This bug was discovered during testing on the dedicated elegant step-down feature branch (https://github.com/wiredtiger/wiredtiger/compare/develop...wt-17785-enable-elegant-stepdown-mainine). Currently, step-down restarts for test/format. In this branch, we are replacing the restart with a synchronous, elegant step-down triggered via reconfigure(role=follower). This ticket captures one of the bugs as a result of elegant stepdown.
Root Cause
When a leader dirties or splits an internal page before stepping down, that page remains resident in the cache after step-down. The WT-17794 fix (clearing dirty state on outdated disagg read-only pages) only handles leaf pages; internal pages are left in an unresolvable state. When the eviction server or an application-assist thread later selects such an internal page for eviction, it hits the split-generation safety assertion in __wt_evict (evict_page.c:501):
WT\_ASSERT\(session,
closing || \!F\_ISSET\(ref, WT\_REF\_FLAG\_INTERNAL\) ||
F\_ISSET\(session\->dhandle, WT\_DHANDLE\_DEAD | WT\_DHANDLE\_EXCLUSIVE\) ||
\!\_\_wt\_gen\_active\(session, WT\_GEN\_SPLIT, page\->pg\_intl\_split\_gen\)\);
The assertion fires because after step-down the dhandle is marked read-only and outdated but neither dead nor exclusive, and the split generation of the leader's internal page is still active.
Evergreen Task / Link
https://spruce.corp.mongodb.com/version/6a420baae62e3f000732f9f0/tasks (3 occurrences)
https://spruce.corp.mongodb.com/version/6a42393e3def22000738b429/tasks (4 occurrences)
Logs & Stack Trace
Fires from the eviction server thread:
file:T00001.wt\_stable, eviction\-server: \[WT\_VERB\_DEFAULT\]\[ERROR\]: *wt\_evict, 501: WiredTiger assertion failed: 'closing || \!F\_ISSET\(ref, WT\_REF\_FLAG\_INTERNAL\) || F\_ISSET\(session\->dhandle, WT\_DHANDLE\_DEAD | WT\_DHANDLE\_EXCLUSIVE\) || \!*wt\_gen\_active\(session, WT\_GEN\_SPLIT, page\->pg\_intl\_split\_gen\)' file:T00001.wt\_stable, eviction\-server: \[WT\_VERB\_DEFAULT\]\[ERROR\]: \_\_wt\_abort, 32: aborting WiredTiger library
#3 \_\_wt\_abort \(session=0x30dc3fc8f000\) at src/os\_common/os\_abort.c:32 #4 \_\_wt\_evict \(session=0x30dc3fc8f000, ref=0x30dc35bfa8c0, previous\_state=3, flags=0\) at src/evict/evict\_page.c:501 #5 \_\_wti\_evict\_page \(session=0x30dc3fc8f000, is\_server=false\) at src/evict/evict\_dispatch.c:254 #6 \_\_wti\_evict\_lru\_pages \(session=0x30dc3fc8f000, is\_server=false\) at src/evict/evict\_queue.c:140 #7 \_\_evict\_thread\_run \(session=0x30dc3fc8f000, thread=0x30dc3fe44ff0\) at src/evict/evict\_thread.c:117 #8 \_\_thread\_run \(arg=0x30dc3fe44ff0\) at src/support/thread\_group.c:32
Also fires from an application-assist thread during transaction rollback:
#3 \_\_wt\_evict \(session=0x71d43fca3800, ref=0x71d43fe0c6e0, previous\_state=3, flags=0\) at src/evict/evict\_page.c:501 #4 \_\_wti\_evict\_page \(session=0x71d43fca3800, is\_server=false\) at src/evict/evict\_dispatch.c:254 #5 \_\_wti\_evict\_app\_assist\_worker \(session=0x71d43fca3800\) at src/evict/evict\_dispatch.c:385 #6 \_\_wt\_evict\_app\_assist\_worker\_check at src/evict/evict\_inline.h:990 #7 \_\_wt\_txn\_rollback at src/txn/txn.c:2220 #8 \_\_session\_rollback\_transaction at src/session/session\_api.c:2084 #9 rollback\_transaction at test/format/ops.c:664 #10 ops at test/format/ops.c:1437
Observed consistently across 3 separate patch runs on the elegant step-down branch (3--4 occurrences per run). Always on disagg-switch variants. Same family as WT-17794 -- that fix cleared dirty state on outdated disagg read-only leaf pages; a parallel fix is needed for internal pages.
- related to
-
WT-17105 Disagg Bugs
-
- Open
-