[SERVER-34027] Production issue under high load. Created: 21/Mar/18  Updated: 07/Apr/23  Resolved: 02/Aug/20

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 3.6.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Юрий Соколов Assignee: David Daly
Resolution: Duplicate Votes: 1
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MongoStat-2018-03-21-long.png     PNG File MongoStat-2018-03-21-short.png     PNG File SystemStat-2018-03-21-long.png     PNG File SystemStat-2018-03-21-short.png     PNG File decommit.png     File diagnostic.tar.gz     PNG File henrik-include-centralcache.png    
Issue Links:
Depends
Duplicate
duplicates SERVER-31417 Improve tcmalloc when decommitting la... Backlog
is duplicated by SERVER-31417 Improve tcmalloc when decommitting la... Backlog
Problem/Incident
Related
Operating System: ALL
Sprint: Dev Tools 2019-05-06, Dev Tools 2019-04-22
Participants:

 Description   

We had an issue in production yesturday:

  • at ~16:31 UTC undetected queries starts to intensively read cold data. Read form disks greatly increased, as well as eviction from mongo's page cache. This unexpected load lasts till ~17:12.
    The load is not issue by itself remarkable for report here. But it leads to following bad behavour:
  • at 16:42:44 mongod literally freezes. It doesn't respond to any thing, doesn't write anything to log, doesn't send statistic till 16:43:06.
    I believe it is bug.
    I'm attaching diagnostic data file that covers that period, and screenshot of monitoring for long period (16:20-17:30) and focused on issue (short 16:40-16:45)

Environment: aws i3.x16large, mondo data path is placed on lvm volume over two NVMe devices.



 Comments   
Comment by Henrik Ingo (Inactive) [ 16/Jan/20 ]

Continuing to work on this issue I have also started to focus on the central cache. (This holds free memory objects between the pageheap and per-thread caches.)

I just wanted to store for the record, the central_cache_free_bytes here is a steady 35 GB. It is not affected by the stall / decommits from pageheap to OS.

In my own tests, under heavy disk bound load the central cache is sometimes as high as 100 GB.

Comment by Henrik Ingo (Inactive) [ 12/Dec/19 ]

I've been working on trying to reproduce this issue. Based on analysing the diagnostic data file you provided, I created a sysbench workload tcmalloc_repro.lua. The commentary at the top of the file summarizes the characteristics I deduced from this ticket.

Running first the data load for 3 days and the actual workload for 3 hours, I was able to reproduce a situation where tcmalloc pageheap_free_memory reaches 62 - 68 GB, like in this ticket. However, I was NOT able to reproduce an actual stall.

The difference between the diagnostic data analyzed in this thread vs my repro is that in my repro pageheap decommit count remains small (about 10, vs 3 million above). This means that when tcmalloc finally does decommit memory, it is able to decommit more bytes at a time and therefore free memory becomes available quickly enough that mongod doesn't get completely stuck. In this ticket average bytes per decommit is 23KB and in my repro it was 54MB.

In related news SERVER-33296 is now done. This allows to adjust tcmalloc release rate via setParameter.

db.adminCommand({setParameter: 1, tcmallocReleaseRate: 10})

This is backported to all stable branches down to 3.6, and will become available in the next stable releases of each.

From what we have seen so far, we believe that dialing up tcmallocReleaseRate to a bigger value than 1, will prevent pageheap_free_memory from growing indefinitely and that in turn would also avoid stalls such as seen in this ticket. The first part of that I will be able to verify with tcmalloc_repro.lua.

Comment by Githook User [ 09/Dec/19 ]

Author:

{'email': 'henrik.ingo@mongodb.com', 'name': 'Henrik Ingo', 'username': 'henrikingo'}

Message: PERF-1898 Add test_control.yml for SERVER-34027 repro (#502)
Branch: master
https://github.com/10gen/dsi/commit/414e539b42e3a66684548b6a98024fd7c3bf2e91

Comment by Henrik Ingo (Inactive) [ 09/Aug/19 ]

We recently learned that TCMALLOC_RELEASE_RATE can be used to make tcmalloc release memory more frequently to the OS. We haven't tested anything that would resemble a repro of this issue, but I could speculate that using TCMALLOC_RELEASE_RATE=10 could cause tcmalloc to release memory more frequently and in smaller chunks.

SERVER-42697 will add tcmalloc_release_rate tuning via setParameter.

Comment by Andrew Morrow (Inactive) [ 13/May/19 ]

david.daly - Handing over to you per e-mail discussion.

Comment by Mark Benvenuto [ 08/Aug/18 ]

The most interesting sample is the sample the was collected from 2018-03-20T16:42:44Z to 2018-03-20T16:43:06.122Z. This meant that it took approximately ~24 seconds to collect one FTDC sample when it normally takes < 10 ms.

Accounting for where the time is spent is hard. The serverStatus command usually logs that which sub-section takes the longest (which is probably long gone). What is clear is the following:

1. The change in decommits between the this sample and previous sample: 3010448 (3 million).
2. To report the number of decommits or any other TCMalloc stats requires taking the page heap lock (https://github.com/mongodb/mongo/blob/210bb5d91cb3c77bb3ed169114f8b85cd1062fb3/src/third_party/gperftools-2.5/src/tcmalloc.cc#L642-L778)
3. We know that allocate pages takes this same lock (https://github.com/mongodb/mongo/blob/210bb5d91cb3c77bb3ed169114f8b85cd1062fb3/src/third_party/gperftools-2.5/src/tcmalloc.cc#L1229-L1258) and in order to limit the amount of memory fragmentation, TCMalloc may garbage collect memory (https://github.com/mongodb/mongo/blob/210bb5d91cb3c77bb3ed169114f8b85cd1062fb3/src/third_party/gperftools-2.5/src/page_heap.cc#L124-L154)
4. We did 71 scavenges during this same period from a Central free list bucket growing too large.

This meant that is was the TCMalloc page heap GC that stalled the system since calling munmap 3 million times in a spin lock is slow.

There are a few possible fixes:

  1. Introduce a new state into the free list location enum (https://github.com/mongodb/mongo/blob/210bb5d91cb3c77bb3ed169114f8b85cd1062fb3/src/third_party/gperftools-2.5/src/span.h#L64-L66) to support a garbage collect that can be deferred to a separate thread that will do a background free of pages in smaller increments.
  2. Temporarily turn on aggressive decommit and then make disable after a period of time instead of flushing all memory. This may not help the fragmentation of the page heap but it would prevent the massive stall.
  3. Turn off the fragmentation flush or make it smaller

Upstream has not made any changes to this code as of the current date. We could pursue filling a bug after creating a smaller repro, and solicit their opinion.

Comment by Ian Whalen (Inactive) [ 05/Aug/18 ]

ping mark.benvenuto

Comment by Bruce Lucas (Inactive) [ 21/Mar/18 ]

Hi Yura,

Thank you for the very detailed problem report! Here's what we see in the diagnostic data:

It looks like was caused by tcmalloc decommitting (returning to o/s) its entire pageheap of free memory, consisting of about 66 GB in about 3 M spans. Based on that it appears that this is the same issue as SERVER-31417.

Thanks,
Bruce

Generated at Thu Feb 08 04:35:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.