Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20306

75% excess memory usage under WiredTiger during stress test

    • Fully Compatible
    • ALL

      Issue Status as of Sep 30, 2016

      ISSUE SUMMARY
      MongoDB with WiredTiger may experience excessive memory fragmentation. This was mainly caused by the difference between the way dirty and clean data is represented in WiredTiger. Dirty data involves smaller allocations (at the size of individual documents and index entries), and in the background that is rewritten into page images (typically 16-32KB). In 3.2.10 and above (and 3.3.11 and above), the WiredTiger storage engine only allows 20% of the cache to become dirty. Eviction works in the background to write dirty data and keep the cache from being filled with small allocations.

      That changes in WT-2665 and WT-2764 limit the overhead from tcmalloc caching and fragmentation to 20% of the cache size (from fragmentation) plus 1GB of cached free memory with default settings.

      USER IMPACT
      Memory fragmentation caused MongoDB to use more memory than expected, leading to swapping and/or out-of-memory errors.

      WORKAROUNDS
      Configure a smaller WiredTiger cache than the default.

      AFFECTED VERSIONS
      MongoDB 3.0.0 to 3.2.9 with WiredTiger.

      FIX VERSION
      The fix is included in the 3.2.10 production release.

      This ticket is a spin-off from SERVER-17456, relating to the last issue discussed there.

      Under certain workloads a large amount of memory in excess of allocated memory is used. This appears to be due to fragmentation, or some related memory allocation inefficiency. Repro consists of:

      • mongod running with 10 GB cache (no journal to simplify the situation)
      • create a 10 GB collection of small documents called "ping", filling the cache
      • create a second 10 GB collection, "pong", replacing the first in the cache
      • issue a query to read the first collection "ping" back into the cache, replacing "pong"

      Memory stats over the course of the run:

      • from A-B "ping" is being created, and from C-D "pong" is being created, replacing "ping" in the cache
      • starting at D "ping" is being read back into the cache, evicting "pong". As "pong" is evicted from cache in principle the memory so freed should be usable for reading "ping" into the cache.
      • however from D-E we see heap size and central cache free bytes increasing. It appears that for some reason the memory freed by evicting "pong" cannot be used to hold "ping", so it is being returned to the central free list, and instead new memory is being obtained from the OS to hold "ping".
      • at E, while "ping" is still being read into memory, we see a change in behavior: free memory appears to have been moved from the central free list to the page heap. WT reports number of pages is no longer increasing. I suspect that at this point "ping" has filled the cache and we are successfully recycling memory freed by evicting older "ping" pages to hold newer "ping" pages.
      • but the net is still about 7 GB of memory in use by the process beyond the 9.5 GB allocated and 9.2 GB in the WT cache, or about a 75% excess.

      Theories:

      • smaller buffers freed by evicting "pong" are discontiguous and cannot hold larger buffers required for reading in "ping"
      • the buffers freed by evicting "pong" are contiguous, but adjacent buffers are not coalesced by the allocator
      • buffers are eventually coalesced by the allocator, but not in time to be used for reading in "ping"

        1. AggressiveReclaim.png
          60 kB
          Alexander Gorrod
        2. buildInfo.txt
          1 kB
          rmaheshwari
        3. buildInfo.txt
          1 kB
          rmaheshwari
        4. collStatsLocalOplog.txt
          7 kB
          rmaheshwari
        5. collStatsLocalOplog.txt
          7 kB
          rmaheshwari
        6. es
          25 kB
          Mark Callaghan
        7. frag-ex1.png
          170 kB
          Bruce Lucas
        8. getCmdLineOpts.txt
          0.9 kB
          rmaheshwari
        9. getCmdLineOpts.txt
          0.9 kB
          rmaheshwari
        10. hostInfo.txt
          1 kB
          rmaheshwari
        11. max-heap.png
          58 kB
          Michael Cahill
        12. memory-use.png
          108 kB
          Michael Cahill
        13. metrics.2016-06-07T21-19-37Z-00000.gz
          3.81 MB
          Mark Callaghan
        14. MongoDBDataCollectionDec10-mongo42-memory.png
          188 kB
          Bruce Lucas
        15. NoAggressiveReclaim.png
          91 kB
          Alexander Gorrod
        16. pingpong.png
          225 kB
          Bruce Lucas
        17. pingpong-decommit.png
          157 kB
          Bruce Lucas
        18. repro-32.sh
          1 kB
          Bruce Lucas
        19. repro-32-diagnostic.data-325-detail.png
          143 kB
          Bruce Lucas
        20. repro-32-diagnostic.data-325-overview.png
          123 kB
          Bruce Lucas
        21. repro-32-diagnostic.data-335-detail.png
          140 kB
          Bruce Lucas
        22. repro-32-insert.sh
          1 kB
          Bruce Lucas
        23. repro-32-insert-diagnostic.data-326.png
          183 kB
          Bruce Lucas
        24. repro-32-insert-diagnostic.data-335.png
          181 kB
          Bruce Lucas
        25. rsStatus.txt
          2 kB
          rmaheshwari
        26. serverStatus.txt
          22 kB
          rmaheshwari

            Assignee:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Reporter:
            bruce.lucas@mongodb.com Bruce Lucas (Inactive)
            Votes:
            21 Vote for this issue
            Watchers:
            78 Start watching this issue

              Created:
              Updated:
              Resolved: