Memory leak and sluggish performance after tens of hours of uptime

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: 3.2.10, 3.2.12
    • Component/s: Performance
    • None
    • ALL
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      After two days of uptime, some of my shards start responding slower and slower to queries. Even if i stop all the workers doing load into the database and wait for all operations to finish, the mongod instances keep responding very slow when restarting the workers. Queries that took 0.1 seconds to run take 40-50 seconds or more. The operations that seem to trigger this behaviour are bulk updates to a collection ($pull a "job") and insertion into another collection from a different database. The wired tiger cache fills up on those machines, the mongod instance eats up all available RAM and CPU. Running

      sync && echo 3 > /proc/sys/vm/drop_caches
      

      doesn't help.

      I am pretty sure this is a bug, because after i restart all the mongod instances i have no problem whatsoever for 2-3 days. I have a very fast storage so i don't mind loading the hot data. How can i investigate this problem? What metrics should i monitor? I've tried creating a dummy stress test script to run agains a 3.2 instance to see if i can trigger the bug and compare with a 3.4 instance, but i haven't succeeded yet.

      Running too many bulk inserts and updates in the same collection seems to be the culprit. The operations start waiting one after the other and the yields start piling up one on eachother. This is when RAM and CPU usage spike and performance degrades until the restart.

            Assignee:
            Kelsey Schubert
            Reporter:
            Tudor Aursulesei
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: