-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Cache and Eviction
-
None
-
Storage Engines - Transactions
-
671.129
-
SE Transactions - 2026-06-19, SE Transactions - 2026-07-03
-
3
Background
HELP-94512 reported a disaggregated-storage standby node (Atlas Infinite, RC1010) that became effectively unavailable: the WiredTiger cache filled to ~95% (94.9% dirty), eviction made zero progress, CPU spiked, the node stopped reporting metrics, and reads timed out. The node never recovered on its own and required a manual restart.
Root cause of the unrecoverable state
On a disagg standby, dirty pages cannot be reconciled locally; they are pinned by the materialization frontier. When the upstream page materializer stalls (in this incident, page-server disk exhaustion froze the materialized LSN), the frontier stops advancing and the dirty pages become un-writable:
- Eviction cannot drain them (reconcile cannot write the page).
- The production cache back-pressure path (cache_max_wait_ms + WT_ROLLBACK on the largest transaction) cannot help either, because rolling back an already-applied oplog transaction does not make a frontier-pinned page writable.
The cache therefore stays pinned indefinitely and the node hangs.
Why WiredTiger did not self-recover
WiredTiger does have a "cache stuck for too long, give up" path that returns ETIMEDOUT, which propagates as WT_RET_PANIC -> WT_CONN_PANIC -> mongod fassert -> external supervisor restart. However, that path is compiled only under HAVE_DIAGNOSTIC. In a release build:
- If eviction verbose is enabled, it only dumps txn/cache state and resets the timer.
- By default (verbose off) it returns early without even logging.
So production builds have no "give up and crash" action for a stuck cache. This was an acceptable trade-off for classic (ASC) storage, where dirty pages can always be reconciled to local disk and the rollback/timeout back-pressure is sufficient to drain the cache. Disaggregated storage introduces a new failure mode (un-writable dirty pages on a standby) that the classic back-pressure mechanism does not cover.
Requested work
Implement a mechanism for disaggregated standby nodes to detect a prolonged, unrecoverable cache-stuck condition and self-recover, preferably by crashing so the supervisor restarts the node automatically (as suggested by the reporter in HELP-94512).
Design considerations:
- Detection should target the disagg-specific signal (frozen materialized LSN / matLsnAdv stalled while cache is hard-pressured) rather than blindly reusing the classic cache-stuck heuristic, to avoid false positives on a temporarily slow-but-healthy node.
- Decide whether crash/restart is the right action versus step-down or targeted alerting.
- Restarting only helps if the upstream stall has cleared; include crash-loop protection.
See HELP-94512 for the full FTDC timeline and the SLS-side root cause.