[SERVER-20306] 75% excess memory usage under WiredTiger during stress test Created: 06/Sep/15 Updated: 16/Nov/21 Resolved: 30/Sep/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.0.6, 3.1.7, 3.2.5, 3.3.5 |
| Fix Version/s: | 3.2.10, 3.3.11 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 21 |
| Labels: | WTplaybook, code-only | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Issue Status as of Sep 30, 2016 ISSUE SUMMARY That changes in USER IMPACT WORKAROUNDS AFFECTED VERSIONS FIX VERSION This ticket is a spin-off from Under certain workloads a large amount of memory in excess of allocated memory is used. This appears to be due to fragmentation, or some related memory allocation inefficiency. Repro consists of:
Memory stats over the course of the run:
Theories:
|
| Comments |
| Comment by Rakhi Maheshwari [ 09/Oct/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
MongoDB memory usage very high 75%. For Mongodb01. Mongodb02, Mongodb03 Available Memory went down to 48%, 29%, 31% from 67%, 46%, 50% respectively throughout the load testing of 15 days. Are there chances that memory usage goes to 90% and process may kill. Why it is not returning free space to OSrsStatus.txt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 30/Sep/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Given the changes in Here are the results of running the attached repro-32-insert.sh script against 3.2.6 and 3.2.10-rc2. Each run varied the $gb variable that determines both the WiredTiger cache size and the volume of data inserted. For each run, I report the maximum value seen during the run for db.serverStatus().mem.resident. Here is a graph of peak memory use for 3.2.6 and 3.2.10-rc2:
As you can see, where there used to be 60+% RAM use over the configured cache size, with 3.2.10 the maximum RAM use tracks the cache size to within a few percent for larger cache sizes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 08/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mdcallag, I've attached a graph generated from your diagnostic data:
What this graph shows is that the WiredTiger cache use is varying between 100-125GB but the pattern of allocation and freeing is causing large amounts of freed memory to accumulate in tcmalloc. In particular, the central freelist grows to 73GB, which accounts for heap size being much larger than the WiredTiger cache. In our testing, stock jemalloc has similar overall behavior in the face of this pattern of allocation using the standard malloc/free interface. We think we could improve the situation with jemalloc by using multiple arenas: the first step towards that is in We do not have a solution today – the best workaround we have is to use a smaller cache size. I am working in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 08/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
es = db.serverStatus() These are at test end using tcmalloc | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 08/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Repeated a test using tcmalloc instead of jemalloc. WT block cache is 128G. With old jemalloc install VSZ/RSS was 236G/208G. With bundled tcmalloc VSZ/RSS was 192G/191G. I will upload the output from db.serverStatus() and the metrics file from diagnostic.data | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Will repeat a test this week. Previous tests used jemalloc On Mon, Jun 6, 2016 at 2:55 PM, Martin Bligh (JIRA) <jira@mongodb.org> – | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mdcallag When it's overallocated, can you grab db.serverStatus().tcmalloc ? The virtual size being over configured is fine, but RSS is not. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 06/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried both of these and they reduced the max RSS to not much larger than the WT block cache. That is good. The problems were: "eviction_dirty_target=20,eviction_dirty_trigger=25" | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 02/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have been doing many tests with MongoDB 3.3.5. With the WT block cache set to 128G I have mongod at ~300G for VSZ and ~242G for RSS. My workload is an insert-only test followed by inserts + queries via the insert benchmark. This isn't a problem for the RocksDB or mmapv1 engines. I use jemalloc instead of tcmalloc. Let me know if you want more details. I think this problem is made worse by My benchmark client is at https://github.com/mdcallag/mytools/tree/master/bench/ibench which uses an older version of the MongoDB Python client. I use this test command line for the following script and np.sh is in the github repo linked above:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 06/May/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Some metrics showing behavior on 3.2 and 3.3: version 3.2.6
version 3.3.5
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 28/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas Thanks for looking at this, I'll look in more detail tomorrow, but my initial suspicion is that we will still need the WT "gross accounting" fix for the pessimal scenario | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 27/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
martin.bligh, fyi - as per above, I'm not seeing an improvement in 3.3.5 on this test. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 27/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The original repro script attached to this ticket does not reproduce the problem under 3.2 because changes to WT cache management prevent it from filling the cache. I've attached a modified version as repro-32.sh that does reproduce the problem
Here's an overview of a run on 3.2.5:
This produces fragmentation, measured by (heap_size-allocated)/allocated, of about 50%. Here we zoom in on the point where the fragmentation occurs, comparing a run on 3.2.5 and 3.3.5: version 3.2.5 version 3.3.5
Detailed statistics at D where heap size reaches maximum - this is the most fair point to measure fragmentation because it is all due to heap expansion, which indicates inability to use existing free memory:
Most of the fragmentation is accounted for by about 2 GB of free memory in the central cache; new statistics collected in 3.3.5 show that most of this is accounted for by about 1.8 GB of free 32-byte buffers, interspersed among about 2.4 GB of live 32-byte buffers. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 08/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This issue is still being investigated and there's no fix for it at the moment. It has been scheduled for the current development cycle. Please continue to watch the ticket for updates. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Hawkins [ 08/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Has anything been done to fix the fragmentation issue?, if not it should | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johnny Shields [ 08/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can this issue be closed? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 04/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A different use case that also generates fragmentation is separately tracked by | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Hawkins [ 27/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Upgrade was successfull, we saw a considerable improvement in both perfromance and memory stability, only time will tell however, we are going to start transfering more load to our canary cluster, and planning a production upgrade. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 20/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
That sounds like a good plan - let us know if you encounter any further issues. We will update this ticket once we have a fix for the excessive memory usage issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Hawkins [ 19/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Alexander, thanks for the prompt response, your answer just confirmed our own suspicion that an upgrade to 3.0.8 wpuld help. We will upgrade our "canary" system first, its identical to our production system and has a few low revenue websites connected to it, it will enable us to determine if there is anything in our application that does not get on well with 3.0.8. We had been observing the same issue on both production and canary. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 19/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
thawkins Sorry you have run into a problem with memory consumption and MongoDB. The first thing I recommend trying is to upgrade to the latest version of MongoDB - there have been several improvements in terms of memory management since 3.0.6 including In terms of a workaround the best solution I can give you is to configure a smaller WiredTiger cache size using the configuration option as defined here: Any additional memory on the machine is likely to still be useful, since the WiredTiger cache stores uncompressed pages, additional available memory is generally used by the file system cache to keep compressed data cached. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Hawkins [ 19/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is there a workaround for this? We think we may have hit this on a high load production cluster running 3.0.6/WT We are seeing memory usage exceeding 90% and getting multiple timouts on queries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 12/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas Allocating sets of structures may help in some situations, but it potentially introduces other problems:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 12/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Would need to check this, but one thing that may be exacerbating the issue is that I believe when inserting to a collection there will be writes to at least two tables - collection and index(es) - interspersed in memory, so that tcmalloc spans will be tied up by small data structures until the corresponding WT pages for both tables are freed. Would it be possible to allocate WT_INSERT and WT_UPDATE structures in per-page, or at least per-table, batches, to avoid this interspersing, so that tcmalloc spans will tend to have only WT_INSERT and WT_UPDATE structures for a single WT page? That might allow tcmalloc spans to be freed for re-use (for the required larger memory allocations) sooner, when the WT_INSERT and/or WT_UPDATE structures for a given WT page that occupy a particular tcmalloc span are freed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 11/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas Thanks for the insightful information. It would be interesting to get your diagnostic code into the WiredTiger tree somehow. Your analysis looks correct to me. I can give some context on the different structure types you reference:
In terms of helping to reduce fragmentation of allocations when a workload switches from insert/update to read I think we can improve the current situation, though some fragmentation will be inevitable. The following table are things I think could help:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 11/Jan/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Instrumented the code to record a string tag with each allocated block to identify its origin (specifically filename and line number). Wrote some code to scan the heap to collect statistics related to allocated blocks. Running the repro described above, where we create a tree in memory, and then replace it with a tree read from disk, at the peak fragmentation just before the old tree is completely evicted, selected stats show worst offenders:
In other words, there are 2815 MB in central free list, mostly in tiny objects in class 4 (48 bytes) and class 5 (64 bytes). This is because as we read in the new tree from disk and evict the old tree currently in memory, we need to allocate 32 kB buffers, but can't use the memory that held the tiny buffers basically until they are all freed, due to fragmentation of that memory by remaining small buffers. The tagged allocation info gives us file name and line number, and tells us for each allocation site how many bytes of that allocation are currently active, and also tells us how sparsely allocated they are, that is, how many bytes of spans those allocated blocks are spread out among. Converting filename and line number to name of data structure and reformatting the last four lines above a bit:
In other words, we have freed almost all the WT_INSERT, WT_UPDATE, WT_REF, and WT_IKEY structures at this point - the amounts of memory those buffers are using is small - but they are spread out among a large amount of memory, tying it up, preventing it from being used for 32 kB allocations. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Feng Yu [ 12/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Mongodb 3.2 will release soon, is there any new progress in this issue? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 12/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For the record, I was hoping that jemalloc would show a different performance profile than tcmalloc for this workload. It did not - the timeseries plot was almost identical. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 09/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think it helps some, but it doesn't completely solve the problem:
Here are some stats from a 10 GB run with aggressive decommit enabled:
Theory: as we allocate internal pages, they end up interspersed among the leaf pages. When we begin evicting, we first evict a lot of leaf pages, and only at B begin to evict the internal pages. This means that from A to B we accumulate a lot of pages that are mostly empty except for some sparse internal pages (last stat shows that of the 10 GB of cache only 44 MB are internal pages), and can neither be decommitted nor reused for the new collection. If that is correct, some possibly naive thoughts about how it could be fixed:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 09/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've done some testing with this use case, and I believe there is a TCMalloc setting that can help. I reduced the size of the test case to 3GB, and with the current MongoDB, I see similar memory growth to that reported with a 10GB test case. The issue is that towards the end of the test memory use spikes up well above the configured 3GB cache size. In my testing to 4.84GB. That additional memory appears in the tcmalloc pageheap_free_bytes statistic. The TCMalloc page heap is the heap used to service memory allocations greater than 32k. There is a configuration option to TCMalloc called "aggressive decommit" that causes the page heap to not hold memory. I ran the reproducer with the head of MongoDB master branch, and generated the following timeseries:
You can see that towards the end of the run the resident memory bumps up to 4.8GB and stays there. I did another run where I turned on the aggressive reclaim flag for TCMalloc, and it generated the following timeseries graph:
The resident set size still bumped up at the end of the run, but it quickly returned to the baseline level. The change to enable the aggressive reclaim flag is simple:
The aggressive reclaim flag is also enabled by default in the newer release of TCMalloc (gperftools 2.4). bruce.lucas Do you think that enabling the aggressive release flag would be enough to resolve this issue? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 09/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
TCMalloc aggressive reclaim on/off | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 06/Sep/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Issue also occurs under 3.1.7. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 06/Sep/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Repro script:
|