[SERVER-34027] Production issue under high load. Created: 21/Mar/18 Updated: 07/Apr/23 Resolved: 02/Aug/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 3.6.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Юрий Соколов | Assignee: | David Daly |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Dev Tools 2019-05-06, Dev Tools 2019-04-22 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
We had an issue in production yesturday:
Environment: aws i3.x16large, mondo data path is placed on lvm volume over two NVMe devices. |
| Comments |
| Comment by Henrik Ingo (Inactive) [ 16/Jan/20 ] | |
|
Continuing to work on this issue I have also started to focus on the central cache. (This holds free memory objects between the pageheap and per-thread caches.) I just wanted to store for the record, the central_cache_free_bytes here is a steady 35 GB. It is not affected by the stall / decommits from pageheap to OS. In my own tests, under heavy disk bound load the central cache is sometimes as high as 100 GB. | |
| Comment by Henrik Ingo (Inactive) [ 12/Dec/19 ] | |
|
I've been working on trying to reproduce this issue. Based on analysing the diagnostic data file you provided, I created a sysbench workload tcmalloc_repro.lua. The commentary at the top of the file summarizes the characteristics I deduced from this ticket. Running first the data load for 3 days and the actual workload for 3 hours, I was able to reproduce a situation where tcmalloc pageheap_free_memory reaches 62 - 68 GB, like in this ticket. However, I was NOT able to reproduce an actual stall. The difference between the diagnostic data analyzed in this thread vs my repro is that in my repro pageheap decommit count remains small (about 10, vs 3 million above). This means that when tcmalloc finally does decommit memory, it is able to decommit more bytes at a time and therefore free memory becomes available quickly enough that mongod doesn't get completely stuck. In this ticket average bytes per decommit is 23KB and in my repro it was 54MB. In related news SERVER-33296 is now done. This allows to adjust tcmalloc release rate via setParameter.
This is backported to all stable branches down to 3.6, and will become available in the next stable releases of each. From what we have seen so far, we believe that dialing up tcmallocReleaseRate to a bigger value than 1, will prevent pageheap_free_memory from growing indefinitely and that in turn would also avoid stalls such as seen in this ticket. The first part of that I will be able to verify with tcmalloc_repro.lua. | |
| Comment by Githook User [ 09/Dec/19 ] | |
|
Author: {'email': 'henrik.ingo@mongodb.com', 'name': 'Henrik Ingo', 'username': 'henrikingo'}Message: PERF-1898 Add test_control.yml for | |
| Comment by Henrik Ingo (Inactive) [ 09/Aug/19 ] | |
|
We recently learned that TCMALLOC_RELEASE_RATE can be used to make tcmalloc release memory more frequently to the OS. We haven't tested anything that would resemble a repro of this issue, but I could speculate that using TCMALLOC_RELEASE_RATE=10 could cause tcmalloc to release memory more frequently and in smaller chunks.
| |
| Comment by Andrew Morrow (Inactive) [ 13/May/19 ] | |
|
david.daly - Handing over to you per e-mail discussion. | |
| Comment by Mark Benvenuto [ 08/Aug/18 ] | |
|
The most interesting sample is the sample the was collected from 2018-03-20T16:42:44Z to 2018-03-20T16:43:06.122Z. This meant that it took approximately ~24 seconds to collect one FTDC sample when it normally takes < 10 ms. Accounting for where the time is spent is hard. The serverStatus command usually logs that which sub-section takes the longest (which is probably long gone). What is clear is the following: 1. The change in decommits between the this sample and previous sample: 3010448 (3 million). This meant that is was the TCMalloc page heap GC that stalled the system since calling munmap 3 million times in a spin lock is slow. There are a few possible fixes:
Upstream has not made any changes to this code as of the current date. We could pursue filling a bug after creating a smaller repro, and solicit their opinion. | |
| Comment by Ian Whalen (Inactive) [ 05/Aug/18 ] | |
|
ping mark.benvenuto | |
| Comment by Bruce Lucas (Inactive) [ 21/Mar/18 ] | |
|
Hi Yura, Thank you for the very detailed problem report! Here's what we see in the diagnostic data:
It looks like was caused by tcmalloc decommitting (returning to o/s) its entire pageheap of free memory, consisting of about 66 GB in about 3 M spans. Based on that it appears that this is the same issue as SERVER-31417. Thanks, |