Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24580

Improve performance when WiredTiger cache is full

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.2.8, 3.3.10
    • Affects Version/s: None
    • Component/s: WiredTiger
    • Fully Compatible

      When cache utilization hits 95% performance falls off a cliff, severely impacting production.

      If the solution to this isn't to (gently) keep utilization from hitting 95%, then do we need to look at why threads getting involved in evictions at 95% is so impactful? Note that on in the incident on the primary that bruce.lucas analyzed it appeared to me that the shortfall between evictions required to keep the cache steady and actual evictions was only 0.5%, yet the impact on operation rates and latencies to get application threads involved in evictions seemed far out of proportion to the shortfall that they had to make up.

      If on the other hand evictions are really so fundamentally difficult that increasing eviction rate by 0.5% is hard, does it make sense to look at it from the other end, throttling application threads by the 0.5% required (in this example) to make up the shortfall by very slightly reducing rate of pages read into cache? A similar analysis of the lag incident on the secondary showed that the shortfall was about 9%, yet making up that shortfall when the cache hit 95% utilization nearly brings replication to a halt for extended periods.

        1. stalls2-unpatched.png
          stalls2-unpatched.png
          96 kB
        2. stalls2-patched.png
          stalls2-patched.png
          79 kB
        3. stalls1-unpatched.png
          stalls1-unpatched.png
          90 kB
        4. stalls1-patched.png
          stalls1-patched.png
          94 kB
        5. server-24580-patched-recovery.png
          server-24580-patched-recovery.png
          113 kB
        6. secondary-transition.png
          secondary-transition.png
          247 kB
        7. s1646-stacks.png
          s1646-stacks.png
          141 kB
        8. S1646patch.png
          S1646patch.png
          339 kB
        9. s1646-2.png
          s1646-2.png
          234 kB
        10. primary-transition.png
          primary-transition.png
          306 kB
        11. incident-06-18-waiters.png
          incident-06-18-waiters.png
          239 kB
        12. incident-06-18-server.png
          incident-06-18-server.png
          50 kB
        13. incident-06-18.png
          incident-06-18.png
          116 kB
        14. incident-06-12-comparison.png
          incident-06-12-comparison.png
          164 kB
        15. cs31295.png
          cs31295.png
          212 kB
        16. 18-second-gap.png
          18-second-gap.png
          296 kB

            Assignee:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Reporter:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Votes:
            5 Vote for this issue
            Watchers:
            57 Start watching this issue

              Created:
              Updated:
              Resolved: