Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20306

75% excess memory usage under WiredTiger during stress test

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL

      Description

      Issue Status as of Sep 30, 2016

      ISSUE SUMMARY
      MongoDB with WiredTiger may experience excessive memory fragmentation. This was mainly caused by the difference between the way dirty and clean data is represented in WiredTiger. Dirty data involves smaller allocations (at the size of individual documents and index entries), and in the background that is rewritten into page images (typically 16-32KB). In 3.2.10 and above (and 3.3.11 and above), the WiredTiger storage engine only allows 20% of the cache to become dirty. Eviction works in the background to write dirty data and keep the cache from being filled with small allocations.

      That changes in WT-2665 and WT-2764 limit the overhead from tcmalloc caching and fragmentation to 20% of the cache size (from fragmentation) plus 1GB of cached free memory with default settings.

      USER IMPACT
      Memory fragmentation caused MongoDB to use more memory than expected, leading to swapping and/or out-of-memory errors.

      WORKAROUNDS
      Configure a smaller WiredTiger cache than the default.

      AFFECTED VERSIONS
      MongoDB 3.0.0 to 3.2.9 with WiredTiger.

      FIX VERSION
      The fix is included in the 3.2.10 production release.

      This ticket is a spin-off from SERVER-17456, relating to the last issue discussed there.

      Under certain workloads a large amount of memory in excess of allocated memory is used. This appears to be due to fragmentation, or some related memory allocation inefficiency. Repro consists of:

      • mongod running with 10 GB cache (no journal to simplify the situation)
      • create a 10 GB collection of small documents called "ping", filling the cache
      • create a second 10 GB collection, "pong", replacing the first in the cache
      • issue a query to read the first collection "ping" back into the cache, replacing "pong"

      Memory stats over the course of the run:

      • from A-B "ping" is being created, and from C-D "pong" is being created, replacing "ping" in the cache
      • starting at D "ping" is being read back into the cache, evicting "pong". As "pong" is evicted from cache in principle the memory so freed should be usable for reading "ping" into the cache.
      • however from D-E we see heap size and central cache free bytes increasing. It appears that for some reason the memory freed by evicting "pong" cannot be used to hold "ping", so it is being returned to the central free list, and instead new memory is being obtained from the OS to hold "ping".
      • at E, while "ping" is still being read into memory, we see a change in behavior: free memory appears to have been moved from the central free list to the page heap. WT reports number of pages is no longer increasing. I suspect that at this point "ping" has filled the cache and we are successfully recycling memory freed by evicting older "ping" pages to hold newer "ping" pages.
      • but the net is still about 7 GB of memory in use by the process beyond the 9.5 GB allocated and 9.2 GB in the WT cache, or about a 75% excess.

      Theories:

      • smaller buffers freed by evicting "pong" are discontiguous and cannot hold larger buffers required for reading in "ping"
      • the buffers freed by evicting "pong" are contiguous, but adjacent buffers are not coalesced by the allocator
      • buffers are eventually coalesced by the allocator, but not in time to be used for reading in "ping"
      1. es
        25 kB
        Mark Callaghan
      2. metrics.2016-06-07T21-19-37Z-00000.gz
        3.81 MB
        Mark Callaghan
      3. repro-32.sh
        1 kB
        Bruce Lucas
      4. repro-32-insert.sh
        1 kB
        Bruce Lucas
      1. AggressiveReclaim.png
        60 kB
      2. frag-ex1.png
        170 kB
      3. max-heap.png
        58 kB
      4. memory-use.png
        108 kB
      5. MongoDBDataCollectionDec10-mongo42-memory.png
        188 kB
      6. NoAggressiveReclaim.png
        91 kB
      7. pingpong.png
        225 kB
      8. pingpong-decommit.png
        157 kB
      9. repro-32-diagnostic.data-325-detail.png
        143 kB
      10. repro-32-diagnostic.data-325-overview.png
        123 kB
      11. repro-32-diagnostic.data-335-detail.png
        140 kB
      12. repro-32-insert-diagnostic.data-326.png
        183 kB
      13. repro-32-insert-diagnostic.data-335.png
        181 kB

        Issue Links

          Activity

          Hide
          mdcallag Mark Callaghan added a comment -

          Will repeat a test this week. Previous tests used jemalloc

          On Mon, Jun 6, 2016 at 2:55 PM, Martin Bligh (JIRA) <jira@mongodb.org>


          Mark Callaghan
          mdcallag@gmail.com

          Show
          mdcallag Mark Callaghan added a comment - Will repeat a test this week. Previous tests used jemalloc On Mon, Jun 6, 2016 at 2:55 PM, Martin Bligh (JIRA) <jira@mongodb.org> – Mark Callaghan mdcallag@gmail.com
          Hide
          mdcallag Mark Callaghan added a comment -

          Repeated a test using tcmalloc instead of jemalloc. WT block cache is 128G. With old jemalloc install VSZ/RSS was 236G/208G. With bundled tcmalloc VSZ/RSS was 192G/191G. I will upload the output from db.serverStatus() and the metrics file from diagnostic.data

          Show
          mdcallag Mark Callaghan added a comment - Repeated a test using tcmalloc instead of jemalloc. WT block cache is 128G. With old jemalloc install VSZ/RSS was 236G/208G. With bundled tcmalloc VSZ/RSS was 192G/191G. I will upload the output from db.serverStatus() and the metrics file from diagnostic.data
          Hide
          mdcallag Mark Callaghan added a comment -

          es = db.serverStatus()
          metrics.* is from diagnostic.data

          These are at test end using tcmalloc

          Show
          mdcallag Mark Callaghan added a comment - es = db.serverStatus() metrics.* is from diagnostic.data These are at test end using tcmalloc
          Hide
          michael.cahill Michael Cahill added a comment -

          Mark Callaghan, I've attached a graph generated from your diagnostic data:

          What this graph shows is that the WiredTiger cache use is varying between 100-125GB but the pattern of allocation and freeing is causing large amounts of freed memory to accumulate in tcmalloc. In particular, the central freelist grows to 73GB, which accounts for heap size being much larger than the WiredTiger cache.

          In our testing, stock jemalloc has similar overall behavior in the face of this pattern of allocation using the standard malloc/free interface. We think we could improve the situation with jemalloc by using multiple arenas: the first step towards that is in SERVER-24268.

          We do not have a solution today – the best workaround we have is to use a smaller cache size. I am working in WT-2665 on some changes to the patterns of allocation in WiredTiger that should bound the excess memory use.

          Show
          michael.cahill Michael Cahill added a comment - Mark Callaghan , I've attached a graph generated from your diagnostic data: What this graph shows is that the WiredTiger cache use is varying between 100-125GB but the pattern of allocation and freeing is causing large amounts of freed memory to accumulate in tcmalloc. In particular, the central freelist grows to 73GB, which accounts for heap size being much larger than the WiredTiger cache. In our testing, stock jemalloc has similar overall behavior in the face of this pattern of allocation using the standard malloc/free interface. We think we could improve the situation with jemalloc by using multiple arenas: the first step towards that is in SERVER-24268 . We do not have a solution today – the best workaround we have is to use a smaller cache size. I am working in WT-2665 on some changes to the patterns of allocation in WiredTiger that should bound the excess memory use.
          Hide
          michael.cahill Michael Cahill added a comment -

          Given the changes in WT-2665 and WT-2764, both included in 3.2.10 (and 3.3.11), this issue has now been resolved.

          Here are the results of running the attached repro-32-insert.sh script against 3.2.6 and 3.2.10-rc2. Each run varied the $gb variable that determines both the WiredTiger cache size and the volume of data inserted. For each run, I report the maximum value seen during the run for db.serverStatus().mem.resident.

          Here is a graph of peak memory use for 3.2.6 and 3.2.10-rc2:

          As you can see, where there used to be 60+% RAM use over the configured cache size, with 3.2.10 the maximum RAM use tracks the cache size to within a few percent for larger cache sizes.

          Show
          michael.cahill Michael Cahill added a comment - Given the changes in WT-2665 and WT-2764 , both included in 3.2.10 (and 3.3.11), this issue has now been resolved. Here are the results of running the attached repro-32-insert.sh script against 3.2.6 and 3.2.10-rc2. Each run varied the $gb variable that determines both the WiredTiger cache size and the volume of data inserted. For each run, I report the maximum value seen during the run for db.serverStatus().mem.resident . Here is a graph of peak memory use for 3.2.6 and 3.2.10-rc2: As you can see, where there used to be 60+% RAM use over the configured cache size, with 3.2.10 the maximum RAM use tracks the cache size to within a few percent for larger cache sizes.

            People

            • Votes:
              21 Vote for this issue
              Watchers:
              77 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: