Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24580

Improve performance when WiredTiger cache is full

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.2.8, 3.3.10
    • Affects Version/s: None
    • Component/s: WiredTiger
    • Fully Compatible

      When cache utilization hits 95% performance falls off a cliff, severely impacting production.

      If the solution to this isn't to (gently) keep utilization from hitting 95%, then do we need to look at why threads getting involved in evictions at 95% is so impactful? Note that on in the incident on the primary that bruce.lucas analyzed it appeared to me that the shortfall between evictions required to keep the cache steady and actual evictions was only 0.5%, yet the impact on operation rates and latencies to get application threads involved in evictions seemed far out of proportion to the shortfall that they had to make up.

      If on the other hand evictions are really so fundamentally difficult that increasing eviction rate by 0.5% is hard, does it make sense to look at it from the other end, throttling application threads by the 0.5% required (in this example) to make up the shortfall by very slightly reducing rate of pages read into cache? A similar analysis of the lag incident on the secondary showed that the shortfall was about 9%, yet making up that shortfall when the cache hit 95% utilization nearly brings replication to a halt for extended periods.

        1. 18-second-gap.png
          296 kB
          Bruce Lucas
        2. cs31295.png
          212 kB
          Bruce Lucas
        3. incident-06-12-comparison.png
          164 kB
          Bruce Lucas
        4. incident-06-18.png
          116 kB
          Bruce Lucas
        5. incident-06-18-server.png
          50 kB
          Bruce Lucas
        6. incident-06-18-waiters.png
          239 kB
          Bruce Lucas
        7. primary-transition.png
          306 kB
          Bruce Lucas
        8. s1646-2.png
          234 kB
          Bruce Lucas
        9. S1646patch.png
          339 kB
          Bruce Lucas
        10. s1646-stacks.png
          141 kB
          Bruce Lucas
        11. secondary-transition.png
          247 kB
          Bruce Lucas
        12. server-24580-patched-recovery.png
          113 kB
          Michael Cahill
        13. stalls1-patched.png
          94 kB
          Michael Cahill
        14. stalls1-unpatched.png
          90 kB
          Michael Cahill
        15. stalls2-patched.png
          79 kB
          Michael Cahill
        16. stalls2-unpatched.png
          96 kB
          Michael Cahill

            Assignee:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Reporter:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Votes:
            5 Vote for this issue
            Watchers:
            57 Start watching this issue

              Created:
              Updated:
              Resolved: