Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Cache and Eviction
Labels:
None

Assigned Teams:

Storage Engines - Transactions
Total Hours with Assigned Team:
671.129
Sprint:
SE Transactions - 2026-06-19, SE Transactions - 2026-07-03
Story Points:
3

Background

HELP-94512 reported a disaggregated-storage standby node (Atlas Infinite, RC1010) that became effectively unavailable: the WiredTiger cache filled to ~95% (94.9% dirty), eviction made zero progress, CPU spiked, the node stopped reporting metrics, and reads timed out. The node never recovered on its own and required a manual restart.

Root cause of the unrecoverable state

On a disagg standby, dirty pages cannot be reconciled locally; they are pinned by the materialization frontier. When the upstream page materializer stalls (in this incident, page-server disk exhaustion froze the materialized LSN), the frontier stops advancing and the dirty pages become un-writable:

Eviction cannot drain them (reconcile cannot write the page).
The production cache back-pressure path (cache_max_wait_ms + WT_ROLLBACK on the largest transaction) cannot help either, because rolling back an already-applied oplog transaction does not make a frontier-pinned page writable.

The cache therefore stays pinned indefinitely and the node hangs.

Why WiredTiger did not self-recover

WiredTiger does have a "cache stuck for too long, give up" path that returns ETIMEDOUT, which propagates as WT_RET_PANIC -> WT_CONN_PANIC -> mongod fassert -> external supervisor restart. However, that path is compiled only under HAVE_DIAGNOSTIC. In a release build:

If eviction verbose is enabled, it only dumps txn/cache state and resets the timer.
By default (verbose off) it returns early without even logging.

So production builds have no "give up and crash" action for a stuck cache. This was an acceptable trade-off for classic (ASC) storage, where dirty pages can always be reconciled to local disk and the rollback/timeout back-pressure is sufficient to drain the cache. Disaggregated storage introduces a new failure mode (un-writable dirty pages on a standby) that the classic back-pressure mechanism does not cover.

Requested work

Implement a mechanism for disaggregated standby nodes to detect a prolonged, unrecoverable cache-stuck condition and self-recover, preferably by crashing so the supervisor restarts the node automatically (as suggested by the reporter in HELP-94512).

Design considerations:

Detection should target the disagg-specific signal (frozen materialized LSN / matLsnAdv stalled while cache is hard-pressured) rather than blindly reusing the classic cache-stuck heuristic, to avoid false positives on a temporarily slow-but-healthy node.
Decide whether crash/restart is the right action versus step-down or targeted alerting.
Restarting only helps if the upstream stall has cleared; include crash-loop protection.

See HELP-94512 for the full FTDC timeline and the SLS-side root cause.

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Shoufu Du
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Jun 03 2026 02:50:24 AM UTC
Updated:: Jun 19 2026 06:00:19 AM UTC

Details

Description

Background

Root cause of the unrecoverable state

Why WiredTiger did not self-recover

Requested work

Attachments

Activity

People

Dates