Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT12.0.0, 9.1.0-rc0
Affects Version/s: None
Component/s: Logging
Labels:
None

Assigned Teams:

Storage Engines - Transactions
Total Hours with Assigned Team:
1,240.397
Sprint:
SE Transactions - 2026-06-19, SE Transactions - 2026-07-03, SE Transactions - 2026-07-17, SE Transactions - 2026-07-31
Story Points:
1

Backport Requested:

v9.0, v8.3
User Summary:
Requested

Symptom

In a disaggregated Atlas cluster (TSBS load), the eviction-server thread panicked with error 34 (ERANGE, "Numerical result out of range") while running the cache-stuck per-session transaction-state dump. The ERANGE propagates to WT_RET_PANIC in the eviction thread run loop, turning a recoverable "cache stuck" diagnostic dump into a fatal WT_PANIC + fassert (mongod crash + restart). Originally surfaced via AF-17533 (MongoDB 9.0.0-rc1010).

Root cause

The per-transaction detail line in __wt_verbose_dump_txn_one() (src/txn/txn.c) is formatted into a buffer sized:

buf_len = (uint32_t)snapshot_buf->size + 512;
if (txn_err_info->err_msg != NULL)
    buf_len += strlen(txn_err_info->err_msg);
WT_ERR(__wt_scr_alloc(session, buf_len, &buf));
WT_ERR(__wt_snprintf((char *)buf->data, buf_len, "transaction id: ...", ...));

~~WT-16954~~ made the snapshot list and the last-saved error message dynamic, but left every other field under the fixed 512-byte slack. Measured against the verbatim format string:

Literal field labels (all % specifiers removed): 390 bytes (always present).
Variable fields excluding snapshot + err_msg, worst case: 338 bytes (six timestamps up to 25 bytes each via WT_TS_INT_STRING_SIZE, several uint64 IDs up to 20 digits, a 32-byte LSN string WT_MAX_LSN_STRING, the 21-byte WT_ISO_READ_COMMITTED tag, two error codes).
Total non-snapshot worst case = 390 + 338 = 728 bytes > 512 (over by 216).

The decisive detail: a reader session with all timestamps (0,0) and tiny IDs already consumes ~481/512, leaving only ~31 bytes of headroom. The active writer session (oldest pinned txn, populated commit/durable/read timestamps, a real checkpoint LSN) trivially exceeds that margin, so __wt_snprintf returns ERANGE.

Why disaggregated triggers it

The format string is unchanged, so it is not a single longer field. Disaggregated keeps a checkpoint effectively always running (real ckpt_lsn) and a timestamped long-running transaction, so the active session's line is fully populated at once and blows the ~31-byte margin deterministically. Classic clusters rarely catch ckpt_lsn and timestamps populated together, so the margin usually survived.

Suggested fix

Make the detail line overflow-proof rather than tuning the constant. Preferred: build the line with a growable scratch buffer using _wt_buf_catfmt (the same pattern already used for snapshot_buf), eliminating the fixed-size _wt_snprintf entirely. Minimal alternative: size buf_len from the true worst-case label + field budget instead of 512.

is related to

WT-16954 Eviction thread panic due to buffer overflow in transaction state dump

Closed

related to

WT-18141 task-timed-out: race-condition-stress-asan-test-2 on ubuntu2004-asan [wiredtiger @ f4fbaabd]

Closed

Assignee:: Ayesha Ahmed
Reporter:: Shoufu Du
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jun 04 2026 12:27:41 AM UTC
Updated:: Jul 20 2026 11:48:17 AM UTC
Resolved:: Jul 17 2026 02:40:43 PM UTC

Details

Description

Symptom

Root cause

Why disaggregated triggers it

Suggested fix

Attachments

Issue Links

Activity

People

Dates