(HELP-86942 Postmortem): Visa - WT Dirty cache going beyond threshold

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines - Transactions
    • SE Transactions - 2026-01-30, SE Transactions - 2026-02-13, SE Transactions - 2026-02-27
    • 8

       During the Jan 2, 2026 incident on Visa's on-prem MongoDB 7.0.22 cluster (ref: HELP-86942), the primary node of shard-0 experienced checkpoints running up to approximately 20 minutes. These long-running checkpoints blocked dirty page eviction, causing the dirty cache fill ratio to progressively rise until it reached the 20% trigger threshold. At that point, application threads were recruited to perform eviction, stalling writes and triggering a cascading failure (connection storm, file descriptor exhaustion, and node crashes).

       

      The cluster has two very large collections with high insert volume. The key mechanism is that eviction cannot evict pages belonging to a collection currently being checkpointed; when those pages become dirty during the checkpoint, the dirty fill ratio rises and cannot fall until the checkpoint completes. This creates a feedback loop: longer checkpoints lead to more dirty data accumulation, which in turn makes the next checkpoint even longer.

      This investigation focuses on what happened in the workload of VISA on Jan 2 that caused the checkpoint to become longer and longer.

       

      Investigation should cover:
          1. What specific workload characteristics on Jan 2 caused checkpoints to extend to ~20 minutes? Was there a change in write volume, document size, or index pressure compared to normal operations?
          2. Evaluate whether the eviction tuning already applied (eviction_dirty_target=1, eviction_updates_target=1, eviction_updates_trigger=30, threads_min=20, threads_max=20) is sufficient, or if further tuning or architectural changes are needed to prevent recurrence.
          3. Assess the applicability of existing improvements (e.g., WT-15211 stepwise eviction for 8.0+, WT-15538 slow eviction with high update ratio) to this customer's scenario and whether any can be backported.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Linh Tran
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: