Improvement ticket for disagg standby cache-stuck self-recovery

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cache and Eviction
    • None
    • Storage Engines - Transactions
    • 671.129
    • SE Transactions - 2026-06-19, SE Transactions - 2026-07-03
    • 3

      Background

      HELP-94512 reported a disaggregated-storage standby node (Atlas Infinite, RC1010) that became effectively unavailable: the WiredTiger cache filled to ~95% (94.9% dirty), eviction made zero progress, CPU spiked, the node stopped reporting metrics, and reads timed out. The node never recovered on its own and required a manual restart.

      Root cause of the unrecoverable state

      On a disagg standby, dirty pages cannot be reconciled locally; they are pinned by the materialization frontier. When the upstream page materializer stalls (in this incident, page-server disk exhaustion froze the materialized LSN), the frontier stops advancing and the dirty pages become un-writable:

      • Eviction cannot drain them (reconcile cannot write the page).
      • The production cache back-pressure path (cache_max_wait_ms + WT_ROLLBACK on the largest transaction) cannot help either, because rolling back an already-applied oplog transaction does not make a frontier-pinned page writable.

      The cache therefore stays pinned indefinitely and the node hangs.

      Why WiredTiger did not self-recover

      WiredTiger does have a "cache stuck for too long, give up" path that returns ETIMEDOUT, which propagates as WT_RET_PANIC -> WT_CONN_PANIC -> mongod fassert -> external supervisor restart. However, that path is compiled only under HAVE_DIAGNOSTIC. In a release build:

      • If eviction verbose is enabled, it only dumps txn/cache state and resets the timer.
      • By default (verbose off) it returns early without even logging.

      So production builds have no "give up and crash" action for a stuck cache. This was an acceptable trade-off for classic (ASC) storage, where dirty pages can always be reconciled to local disk and the rollback/timeout back-pressure is sufficient to drain the cache. Disaggregated storage introduces a new failure mode (un-writable dirty pages on a standby) that the classic back-pressure mechanism does not cover.

      Requested work

      Implement a mechanism for disaggregated standby nodes to detect a prolonged, unrecoverable cache-stuck condition and self-recover, preferably by crashing so the supervisor restarts the node automatically (as suggested by the reporter in HELP-94512).

      Design considerations:

      • Detection should target the disagg-specific signal (frozen materialized LSN / matLsnAdv stalled while cache is hard-pressured) rather than blindly reusing the classic cache-stuck heuristic, to avoid false positives on a temporarily slow-but-healthy node.
      • Decide whether crash/restart is the right action versus step-down or targeted alerting.
      • Restarting only helps if the upstream stall has cleared; include crash-loop protection.

      See HELP-94512 for the full FTDC timeline and the SLS-side root cause.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Shoufu Du
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: