[SERVER-20306] 75% excess memory usage under WiredTiger during stress test Created: 06/Sep/15  Updated: 16/Nov/21  Resolved: 30/Sep/16

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.0.6, 3.1.7, 3.2.5, 3.3.5
Fix Version/s: 3.2.10, 3.3.11

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 21
Labels: WTplaybook, code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File AggressiveReclaim.png     PNG File MongoDBDataCollectionDec10-mongo42-memory.png     PNG File NoAggressiveReclaim.png     Text File buildInfo.txt     Text File buildInfo.txt     Text File collStatsLocalOplog.txt     Text File collStatsLocalOplog.txt     HTML File es     PNG File frag-ex1.png     Text File getCmdLineOpts.txt     Text File getCmdLineOpts.txt     Text File hostInfo.txt     PNG File max-heap.png     PNG File memory-use.png     File metrics.2016-06-07T21-19-37Z-00000.gz     PNG File pingpong-decommit.png     PNG File pingpong.png     PNG File repro-32-diagnostic.data-325-detail.png     PNG File repro-32-diagnostic.data-325-overview.png     PNG File repro-32-diagnostic.data-335-detail.png     PNG File repro-32-insert-diagnostic.data-326.png     PNG File repro-32-insert-diagnostic.data-335.png     File repro-32-insert.sh     File repro-32.sh     Text File rsStatus.txt     Text File serverStatus.txt    
Issue Links:
Depends
depends on WT-2551 Make WiredTiger aware of memory alloc... Closed
Duplicate
duplicates SERVER-20104 WT high memory usage due to high amou... Closed
is duplicated by SERVER-17456 Mongodb 3.0 wiredTiger storage engine... Closed
is duplicated by SERVER-21837 MongoD memory usage higher than wired... Closed
is duplicated by SERVER-22482 Cache growing to 100% followed by crash Closed
Related
related to SERVER-22906 MongoD uses excessive memory over and... Closed
related to SERVER-23069 Improve tcmalloc freelist statistics Closed
is related to WT-6175 tcmalloc fragmentation is worse in 4.... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
Issue Status as of Sep 30, 2016

ISSUE SUMMARY
MongoDB with WiredTiger may experience excessive memory fragmentation. This was mainly caused by the difference between the way dirty and clean data is represented in WiredTiger. Dirty data involves smaller allocations (at the size of individual documents and index entries), and in the background that is rewritten into page images (typically 16-32KB). In 3.2.10 and above (and 3.3.11 and above), the WiredTiger storage engine only allows 20% of the cache to become dirty. Eviction works in the background to write dirty data and keep the cache from being filled with small allocations.

That changes in WT-2665 and WT-2764 limit the overhead from tcmalloc caching and fragmentation to 20% of the cache size (from fragmentation) plus 1GB of cached free memory with default settings.

USER IMPACT
Memory fragmentation caused MongoDB to use more memory than expected, leading to swapping and/or out-of-memory errors.

WORKAROUNDS
Configure a smaller WiredTiger cache than the default.

AFFECTED VERSIONS
MongoDB 3.0.0 to 3.2.9 with WiredTiger.

FIX VERSION
The fix is included in the 3.2.10 production release.

This ticket is a spin-off from SERVER-17456, relating to the last issue discussed there.

Under certain workloads a large amount of memory in excess of allocated memory is used. This appears to be due to fragmentation, or some related memory allocation inefficiency. Repro consists of:

  • mongod running with 10 GB cache (no journal to simplify the situation)
  • create a 10 GB collection of small documents called "ping", filling the cache
  • create a second 10 GB collection, "pong", replacing the first in the cache
  • issue a query to read the first collection "ping" back into the cache, replacing "pong"

Memory stats over the course of the run:

  • from A-B "ping" is being created, and from C-D "pong" is being created, replacing "ping" in the cache
  • starting at D "ping" is being read back into the cache, evicting "pong". As "pong" is evicted from cache in principle the memory so freed should be usable for reading "ping" into the cache.
  • however from D-E we see heap size and central cache free bytes increasing. It appears that for some reason the memory freed by evicting "pong" cannot be used to hold "ping", so it is being returned to the central free list, and instead new memory is being obtained from the OS to hold "ping".
  • at E, while "ping" is still being read into memory, we see a change in behavior: free memory appears to have been moved from the central free list to the page heap. WT reports number of pages is no longer increasing. I suspect that at this point "ping" has filled the cache and we are successfully recycling memory freed by evicting older "ping" pages to hold newer "ping" pages.
  • but the net is still about 7 GB of memory in use by the process beyond the 9.5 GB allocated and 9.2 GB in the WT cache, or about a 75% excess.

Theories:

  • smaller buffers freed by evicting "pong" are discontiguous and cannot hold larger buffers required for reading in "ping"
  • the buffers freed by evicting "pong" are contiguous, but adjacent buffers are not coalesced by the allocator
  • buffers are eventually coalesced by the allocator, but not in time to be used for reading in "ping"


 Comments   
Comment by Rakhi Maheshwari [ 09/Oct/18 ]

MongoDB memory usage very high 75%. For Mongodb01. Mongodb02, Mongodb03 Available Memory went down to 48%, 29%, 31% from 67%, 46%, 50% respectively throughout the load testing of 15 days. Are there chances that memory usage goes to 90% and process may kill. Why it is not returning free space to OSrsStatus.txtcollStatsLocalOplog.txtgetCmdLineOpts.txtbuildInfo.txtserverStatus.txt buildInfo.txt hostInfo.txt

Comment by Michael Cahill (Inactive) [ 30/Sep/16 ]

Given the changes in WT-2665 and WT-2764, both included in 3.2.10 (and 3.3.11), this issue has now been resolved.

Here are the results of running the attached repro-32-insert.sh script against 3.2.6 and 3.2.10-rc2. Each run varied the $gb variable that determines both the WiredTiger cache size and the volume of data inserted. For each run, I report the maximum value seen during the run for db.serverStatus().mem.resident.

Here is a graph of peak memory use for 3.2.6 and 3.2.10-rc2:

As you can see, where there used to be 60+% RAM use over the configured cache size, with 3.2.10 the maximum RAM use tracks the cache size to within a few percent for larger cache sizes.

Comment by Michael Cahill (Inactive) [ 08/Jun/16 ]

mdcallag, I've attached a graph generated from your diagnostic data:

What this graph shows is that the WiredTiger cache use is varying between 100-125GB but the pattern of allocation and freeing is causing large amounts of freed memory to accumulate in tcmalloc. In particular, the central freelist grows to 73GB, which accounts for heap size being much larger than the WiredTiger cache.

In our testing, stock jemalloc has similar overall behavior in the face of this pattern of allocation using the standard malloc/free interface. We think we could improve the situation with jemalloc by using multiple arenas: the first step towards that is in SERVER-24268.

We do not have a solution today – the best workaround we have is to use a smaller cache size. I am working in WT-2665 on some changes to the patterns of allocation in WiredTiger that should bound the excess memory use.

Comment by Mark Callaghan [ 08/Jun/16 ]

es = db.serverStatus()
metrics.* is from diagnostic.data

These are at test end using tcmalloc

Comment by Mark Callaghan [ 08/Jun/16 ]

Repeated a test using tcmalloc instead of jemalloc. WT block cache is 128G. With old jemalloc install VSZ/RSS was 236G/208G. With bundled tcmalloc VSZ/RSS was 192G/191G. I will upload the output from db.serverStatus() and the metrics file from diagnostic.data

Comment by Mark Callaghan [ 06/Jun/16 ]

Will repeat a test this week. Previous tests used jemalloc

On Mon, Jun 6, 2016 at 2:55 PM, Martin Bligh (JIRA) <jira@mongodb.org>


Mark Callaghan
mdcallag@gmail.com

Comment by Martin Bligh [ 06/Jun/16 ]

mdcallag When it's overallocated, can you grab db.serverStatus().tcmalloc ?
The main thing is if the excess is sitting in pageheap_free_bytes or central_cache_free_bytes. If the latter, then the breakout of size_classes below that will show us where.

The virtual size being over configured is fine, but RSS is not.

Comment by Mark Callaghan [ 06/Jun/16 ]

I tried both of these and they reduced the max RSS to not much larger than the WT block cache. That is good. The problems were:
1) the average insert rate during the test dropped by more than 20X
2) the CPU load increased by more than 5X

"eviction_dirty_target=20,eviction_dirty_trigger=25"
"eviction_dirty_target=10,eviction_dirty_trigger=20"

Comment by Mark Callaghan [ 02/Jun/16 ]

I have been doing many tests with MongoDB 3.3.5. With the WT block cache set to 128G I have mongod at ~300G for VSZ and ~242G for RSS. My workload is an insert-only test followed by inserts + queries via the insert benchmark. This isn't a problem for the RocksDB or mmapv1 engines. I use jemalloc instead of tcmalloc. Let me know if you want more details.

I think this problem is made worse by SERVER-16665. For one of my tests I set the block cache to be much larger than the expected database size. But the block cache still gets full as the journal directory gets huge. Once the block cache is full the insert rate drops significantly and the journal size shrinks.

My benchmark client is at https://github.com/mdcallag/mytools/tree/master/bench/ibench which uses an older version of the MongoDB Python client.

I use this test command line for the following script and np.sh is in the github repo linked above:

bash test.sh wt "" /path/to/mongo/binary /path/to/database/directory $storage-device-name 1 8 yes no

e=$1
eo=$2
client=$3
data=$4
dname=$5
checku=$6
dop=$7
mongo=$8
short=$9
 
bash np.sh 500000000 $e "$eo" 3 $client $data  $dop 10 20 0 $dname no $checku 100 0 0 yes $mongo $short
mkdir l
mv o.* l
 
bash np.sh 100000000 $e "$eo" 3 $client $data $dop 10 20 0 $dname no 1 100 2000 1 no $mongo $short
mkdir q2000
mv o.* q2000
 
bash np.sh  50000000 $e "$eo" 3 $client $data $dop 10 20 0 $dname no 1 100 1000 1 no $mongo $short
mkdir q1000
mv o.* q1000
 
bash np.sh   50000000 $e "$eo" 3 $client $data $dop 10 20 0 $dname no 1 100 100 1 no $mongo $short
mkdir q100
mv o.* q100

Comment by Bruce Lucas (Inactive) [ 06/May/16 ]
  • The original repro script on this ticket did not work on 3.2 because changes in WT cache behavior meant that it was not filling the cache.
  • Previously I attached another repro script, repro-32.sh that modified the original script to reproduce the problem on 3.2 and 3.3 by using remove() operations.
  • I'm now attaching a new repro script repro-32-insert.sh that also reproduces the problem by adding some secondary indexes. This new repro script uses only inserts and read queries.

Some metrics showing behavior on 3.2 and 3.3:

version 3.2.6

version 3.3.5

Comment by Martin Bligh [ 28/Apr/16 ]

bruce.lucas Thanks for looking at this, I'll look in more detail tomorrow, but my initial suspicion is that we will still need the WT "gross accounting" fix for the pessimal scenario

Comment by Bruce Lucas (Inactive) [ 27/Apr/16 ]

martin.bligh, fyi - as per above, I'm not seeing an improvement in 3.3.5 on this test.

Comment by Bruce Lucas (Inactive) [ 27/Apr/16 ]

The original repro script attached to this ticket does not reproduce the problem under 3.2 because changes to WT cache management prevent it from filling the cache. I've attached a modified version as repro-32.sh that does reproduce the problem

  • create a collection "ping"
  • creating a collection "pong"
  • remove all documents from "pong "with with db.pong.remove({}); this has the requisite effect of creating a large number of small allocations.
  • then read "ping" into the cache, which requires large allocations that can't be satisfied using the small buffers freed from the previous step

Here's an overview of a run on 3.2.5:

  • A-B: create ping
  • B-C: create pong
  • C-D: db.pong.remove({})
  • E- : read ping

This produces fragmentation, measured by (heap_size-allocated)/allocated, of about 50%.

Here we zoom in on the point where the fragmentation occurs, comparing a run on 3.2.5 and 3.3.5:

version 3.2.5

version 3.3.5

  • B: finish removing pong
  • C: start reading ping
  • D: heap size reaches maximum

Detailed statistics at D where heap size reaches maximum - this is the most fair point to measure fragmentation because it is all due to heap expansion, which indicates inability to use existing free memory:

=== version 3.2.5
 
    current_allocated_bytes  4569051656
    heap_size                6845276160
    fragmentation            (6845276160-4569051656)/4569051656 = 50%
    central_cache_free_bytes 2115761792
 
=== version 3.3.5
 
    current_allocated_bytes  4187186088
    heap_size                6581121024
    fragmentation            (6581121024-4187186088)/4187186088 = 57%
    central_cache_free_bytes 2098806312
    
    size_classes
        3
            bytes_per_object         32
            free_bytes       1887782240
            allocated_bytes  2417573888

Most of the fragmentation is accounted for by about 2 GB of free memory in the central cache; new statistics collected in 3.3.5 show that most of this is accounted for by about 1.8 GB of free 32-byte buffers, interspersed among about 2.4 GB of live 32-byte buffers.

Comment by Ramon Fernandez Marina [ 08/Mar/16 ]

This issue is still being investigated and there's no fix for it at the moment. It has been scheduled for the current development cycle. Please continue to watch the ticket for updates.

Comment by Tim Hawkins [ 08/Mar/16 ]

Has anything been done to fix the fragmentation issue?, if not it should
remain open.

Comment by Johnny Shields [ 08/Mar/16 ]

Can this issue be closed?

Comment by Bruce Lucas (Inactive) [ 04/Mar/16 ]

A different use case that also generates fragmentation is separately tracked by SERVER-22906; any solution to the fragmentation issue on this ticket should also be evaluated against that use case.

Comment by Tim Hawkins [ 27/Jan/16 ]

Upgrade was successfull, we saw a considerable improvement in both perfromance and memory stability, only time will tell however, we are going to start transfering more load to our canary cluster, and planning a production upgrade.

Comment by Alexander Gorrod [ 20/Jan/16 ]

That sounds like a good plan - let us know if you encounter any further issues. We will update this ticket once we have a fix for the excessive memory usage issue.

Comment by Tim Hawkins [ 19/Jan/16 ]

@Alexander, thanks for the prompt response, your answer just confirmed our own suspicion that an upgrade to 3.0.8 wpuld help. We will upgrade our "canary" system first, its identical to our production system and has a few low revenue websites connected to it, it will enable us to determine if there is anything in our application that does not get on well with 3.0.8. We had been observing the same issue on both production and canary.

Comment by Alexander Gorrod [ 19/Jan/16 ]

thawkins Sorry you have run into a problem with memory consumption and MongoDB. The first thing I recommend trying is to upgrade to the latest version of MongoDB - there have been several improvements in terms of memory management since 3.0.6 including WT-2251.

In terms of a workaround the best solution I can give you is to configure a smaller WiredTiger cache size using the configuration option as defined here:
https://docs.mongodb.org/manual/reference/configuration-options/#storage.wiredTiger.engineConfig.cacheSizeGB

Any additional memory on the machine is likely to still be useful, since the WiredTiger cache stores uncompressed pages, additional available memory is generally used by the file system cache to keep compressed data cached.

Comment by Tim Hawkins [ 19/Jan/16 ]

Is there a workaround for this? We think we may have hit this on a high load production cluster running 3.0.6/WT

We are seeing memory usage exceeding 90% and getting multiple timouts on queries.

Comment by Alexander Gorrod [ 12/Jan/16 ]

bruce.lucas Allocating sets of structures may help in some situations, but it potentially introduces other problems:

  • The memory is currently allocated by the thread doing an insert or update, so we don't need to co-ordinate the memory allocation in any way, just the step of getting that allocation onto the page. If we pre-allocated a pool of structures we would need to co-ordinate access.
  • We don't know ahead of time how many insert/update structures a page will require. We could play a trick for append (mostly) workloads where we pre-allocate some, but that would make the serialized point where we switch out to a new page slower (i.e: all threads wanting to append when we switch out a page will have to wait for the group of allocations).
  • We don't know how large the structures need to be ahead of time - they hold the data being inserted into the table. Your test case inserts tiny (empty?) documents, so all the allocations are small and of uniform size.
Comment by Bruce Lucas (Inactive) [ 12/Jan/16 ]

Would need to check this, but one thing that may be exacerbating the issue is that I believe when inserting to a collection there will be writes to at least two tables - collection and index(es) - interspersed in memory, so that tcmalloc spans will be tied up by small data structures until the corresponding WT pages for both tables are freed. Would it be possible to allocate WT_INSERT and WT_UPDATE structures in per-page, or at least per-table, batches, to avoid this interspersing, so that tcmalloc spans will tend to have only WT_INSERT and WT_UPDATE structures for a single WT page? That might allow tcmalloc spans to be freed for re-use (for the required larger memory allocations) sooner, when the WT_INSERT and/or WT_UPDATE structures for a given WT page that occupy a particular tcmalloc span are freed.

Comment by Alexander Gorrod [ 11/Jan/16 ]

bruce.lucas Thanks for the insightful information. It would be interesting to get your diagnostic code into the WiredTiger tree somehow.

Your analysis looks correct to me. I can give some context on the different structure types you reference:

Structure type Description Life cycle summary
WT_INSERT Allocated when a new item is inserted into a collection or index Will be freed when the page the insert was included in is reconciled (evicted)
WT_UPDATE Allocated when an existing item is updated Will be freed when the page is reconciled, or if a newer update is present and this update is no longer required
WT_REF Allocated when a new page is created Freed only when an internal page is evicted from cache

In terms of helping to reduce fragmentation of allocations when a workload switches from insert/update to read I think we can improve the current situation, though some fragmentation will be inevitable. The following table are things I think could help:

Change Potential benefits Potential penalties
Have eviction more aggressively trickle out dirty pages When inserts stop we will continue to free WT_INSERT and WT_UPDATE structures. The closer to 0 the fewer allocations will remain, so less spans should be pinned Potentially increases write amplification (the number of times each page is written). Potentially degrades on disk fill factor of pages.
Allocate WT_REF structures from a separate allocation pool. The WT_REF structures are generally long lived, so having them allocated from a different pool, will avoid sparsely populated pools where allocations of WT_REF structures are interleaved with other allocations. Additional code complexity and different behavior with different allocators
Comment by Bruce Lucas (Inactive) [ 11/Jan/16 ]

Instrumented the code to record a string tag with each allocated block to identify its origin (specifically filename and line number). Wrote some code to scan the heap to collect statistics related to allocated blocks. Running the repro described above, where we create a tree in memory, and then replace it with a tree read from disk, at the peak fragmentation just before the old tree is completely evicted, selected stats show worst offenders:

MALLOC: +   2951831800 ( 2815.1 MiB) Bytes in central cache freelist
 
============= Total size of freelists
class   1 [        8 bytes ] :     1964 objs;   0.0 MiB;   0.0 cum MiB
class   2 [       16 bytes ] :      789 objs;   0.0 MiB;   0.0 cum MiB
class   3 [       32 bytes ] :    41356 objs;   1.3 MiB;   1.3 cum MiB
class   4 [       48 bytes ] : 24726216 objs; 1131.9 MiB; 1133.2 cum MiB
class   5 [       64 bytes ] : 25030513 objs; 1527.7 MiB; 2660.9 cum MiB
class   6 [       80 bytes ] :  1133994 objs;  86.5 MiB; 2747.4 cum MiB
class   7 [       96 bytes ] :    87440 objs;   8.0 MiB; 2755.4 cum MiB
...
 
============= tagged allocation info
 
class 4 (48 bytes); 39131 spans, 641122304 bytes, 611.422 MiB
  tag src/third_party/wiredtiger/src/btree/row_modify.c:274:  alloc: 1255532 objs, 60265536 bytes, 57.474 MiB; 38391 spans, 628998144 span bytes, 599.859 MiB
  tag src/third_party/wiredtiger/src/btree/row_key.c:479:  alloc: 30910 objs, 1483680 bytes, 1.415 MiB; 7922 spans, 129794048 span bytes, 123.781 MiB
 
class 5 (64 bytes); 51450 spans, 842956800 bytes, 803.906 MiB
  tag src/third_party/wiredtiger/src/btree/row_modify.c:246:  alloc: 1173382 objs, 75096448 bytes, 71.618 MiB; 50009 spans, 819347456 span bytes, 781.391 MiB
  tag src/third_party/wiredtiger/src/btree/bt_split.c:765:  alloc: 87277 objs, 5585728 bytes, 5.327 MiB; 19882 spans, 325746688 span bytes, 310.656 MiB

In other words, there are 2815 MB in central free list, mostly in tiny objects in class 4 (48 bytes) and class 5 (64 bytes). This is because as we read in the new tree from disk and evict the old tree currently in memory, we need to allocate 32 kB buffers, but can't use the memory that held the tiny buffers basically until they are all freed, due to fragmentation of that memory by remaining small buffers.

The tagged allocation info gives us file name and line number, and tells us for each allocation site how many bytes of that allocation are currently active, and also tells us how sparsely allocated they are, that is, how many bytes of spans those allocated blocks are spread out among. Converting filename and line number to name of data structure and reformatting the last four lines above a bit:

71 MiB  of  WT_INSERT  are spread out among  781 MiB of spans
57 MiB      WT_UPDATE                        599 MiB
 5 MiB      WT_REF                           310 MiB 
 1 MiB      WT_IKEY                          123 MiB 

In other words, we have freed almost all the WT_INSERT, WT_UPDATE, WT_REF, and WT_IKEY structures at this point - the amounts of memory those buffers are using is small - but they are spread out among a large amount of memory, tying it up, preventing it from being used for 32 kB allocations.

Comment by Feng Yu [ 12/Nov/15 ]

Mongodb 3.2 will release soon, is there any new progress in this issue?

Comment by Alexander Gorrod [ 12/Oct/15 ]

For the record, I was hoping that jemalloc would show a different performance profile than tcmalloc for this workload. It did not - the timeseries plot was almost identical.

Comment by Bruce Lucas (Inactive) [ 09/Oct/15 ]

I think it helps some, but it doesn't completely solve the problem:

  • From a memory pressure performance perspective the fact that it eventually releases the memory is good, but to some extent by that point the harm has already been done: the kernel will already have had to evict stuff from the file cache due to the increased memory pressure, even if transient.
  • The amount of time that it's in that state may not be that brief - in my test below with a 10 GB cache it was a couple of minutes, and would expect it to be longer with the larger caches that are more typical.
  • If my analysis below is correct, it helps with this particular test because it is an extreme test, eventually evicting all of the old collection, allowing it to decommit the memory. In a less extreme workload it may never evict an entire collection leaving memory fragmented by the partial contents of that collection.
  • From an OOM perspective it doesn't help - OOM is OOM, no matter how brief.

Here are some stats from a 10 GB run with aggressive decommit enabled:

  • At A we begin reading in the collection from disk, evicting the collection that is filling the cache.
  • From A to B we see a lot of frees and the central free list builds to 7 GB, but that space
    • is not reused for the data being read from disk, I guess because it is dedicated to buffer sizes too small for that purpose?
    • is not decommitted, I assume because the pages in the free list are not yet completely empty.
  • At B however this changes: the central free cache begins to drop and umappped bytes correspondingly rises, presumably because entire pages are now becoming empty and so are decommitted.
  • Coinciding with this starting at B we see that we are evicting internal pages.

Theory: as we allocate internal pages, they end up interspersed among the leaf pages. When we begin evicting, we first evict a lot of leaf pages, and only at B begin to evict the internal pages. This means that from A to B we accumulate a lot of pages that are mostly empty except for some sparse internal pages (last stat shows that of the 10 GB of cache only 44 MB are internal pages), and can neither be decommitted nor reused for the new collection.

If that is correct, some possibly naive thoughts about how it could be fixed:

  • Use a memory allocator that supports separate heaps, and allocate the internal pages from a separate heap. (Unknown how that would interact with separate thread heaps as in tcmalloc...)
  • Use some kind of sub-allocation scheme for internal pages where large buffers are obtained for internal pages, which are then subdivided by WT, in order to keep all the internal pages together.
  • More aggressively evict internal pages.
Comment by Alexander Gorrod [ 09/Oct/15 ]

I've done some testing with this use case, and I believe there is a TCMalloc setting that can help. I reduced the size of the test case to 3GB, and with the current MongoDB, I see similar memory growth to that reported with a 10GB test case.

The issue is that towards the end of the test memory use spikes up well above the configured 3GB cache size. In my testing to 4.84GB. That additional memory appears in the tcmalloc pageheap_free_bytes statistic. The TCMalloc page heap is the heap used to service memory allocations greater than 32k. There is a configuration option to TCMalloc called "aggressive decommit" that causes the page heap to not hold memory.

I ran the reproducer with the head of MongoDB master branch, and generated the following timeseries:

You can see that towards the end of the run the resident memory bumps up to 4.8GB and stays there.

I did another run where I turned on the aggressive reclaim flag for TCMalloc, and it generated the following timeseries graph:

The resident set size still bumped up at the end of the run, but it quickly returned to the baseline level.

The change to enable the aggressive reclaim flag is simple:

--- a/src/mongo/util/tcmalloc_set_parameter.cpp
+++ b/src/mongo/util/tcmalloc_set_parameter.cpp
@@ -125,6 +125,10 @@ MONGO_INITIALIZER_GENERAL(TcmallocConfigurationDefaults,
     if (getenv("TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES")) {
         return Status::OK();
     }
+    Status status = tcmallocAggressiveMemoryDecommit.setFromString("1");
+    if (!status.isOK()) {
+        return status;
+    }
     return tcmallocMaxTotalThreadCacheBytesParameter.setFromString("0x40000000" /* 1024MB */);
 }

The aggressive reclaim flag is also enabled by default in the newer release of TCMalloc (gperftools 2.4).

bruce.lucas Do you think that enabling the aggressive release flag would be enough to resolve this issue?

Comment by Alexander Gorrod [ 09/Oct/15 ]

TCMalloc aggressive reclaim on/off

Comment by Bruce Lucas (Inactive) [ 06/Sep/15 ]

Issue also occurs under 3.1.7.

Comment by Bruce Lucas (Inactive) [ 06/Sep/15 ]

Repro script:

db=/ssd/db
gb=10
threads=50
 
function start {
    killall -9 -w mongod
    rm -rf $db $db.log
    mkdir -p $db
    mongod --dbpath $db --logpath $db.log --storageEngine wiredTiger --nojournal \
        --wiredTigerCacheSizeGB $gb --fork
}
 
function monitor {
    mongo >ss.log --eval \
        "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000*1)}" &
}
 
# generate a collection
function make {
    cn=$1
    (
        for t in $(seq $threads); do 
            mongo --eval "
                c = db['$cn']
                c.insert({})
                every = 10000
                for (var i=0; c.stats().size < $gb*1000*1000*1000; i++) {
                    var bulk = c.initializeUnorderedBulkOp();
                    for (var j=0; j<every; j++, i++)
                        bulk.insert({})
                    bulk.execute();
                    if ($t==1)
                        print(c.stats(1024*1024).size)
                }
            " &
        done
        wait
    )
}
 
# scan a collection to load it
function load {
    cn=$1
    mongo --eval "
        c = db['$cn']
        print(c.find({x:0}).itcount())
    "
}
 
start      # start mongod
monitor    # monitor serverStatus
make ping  # generate a 10 GB collection
make pong  # generate another 10 GB collection, filling 10 GB cache
sleep 120  # sleep a bit to wait for writes; makes stats clearer
load ping  # scan first 10 GB collection to load it back into cache

Generated at Thu Feb 08 03:53:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.