Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28392

Memory leak and sluggish performance after tens of hours of uptime

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • None
    • 3.2.10, 3.2.12
    • Performance
    • None
    • ALL

    Description

      After two days of uptime, some of my shards start responding slower and slower to queries. Even if i stop all the workers doing load into the database and wait for all operations to finish, the mongod instances keep responding very slow when restarting the workers. Queries that took 0.1 seconds to run take 40-50 seconds or more. The operations that seem to trigger this behaviour are bulk updates to a collection ($pull a "job") and insertion into another collection from a different database. The wired tiger cache fills up on those machines, the mongod instance eats up all available RAM and CPU. Running

      sync && echo 3 > /proc/sys/vm/drop_caches
      

      doesn't help.

      I am pretty sure this is a bug, because after i restart all the mongod instances i have no problem whatsoever for 2-3 days. I have a very fast storage so i don't mind loading the hot data. How can i investigate this problem? What metrics should i monitor? I've tried creating a dummy stress test script to run agains a 3.2 instance to see if i can trigger the bug and compare with a 3.4 instance, but i haven't succeeded yet.

      Running too many bulk inserts and updates in the same collection seems to be the culprit. The operations start waiting one after the other and the yields start piling up one on eachother. This is when RAM and CPU usage spike and performance degrades until the restart.

      Attachments

        Activity

          People

            kelsey.schubert@mongodb.com Kelsey Schubert
            thestick613 Tudor Aursulesei
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: