[SERVER-31417] Improve tcmalloc when decommitting large amounts of memory Created: 05/Oct/17  Updated: 30/Jan/24

Status: Backlog
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Performance Team
Resolution: Unresolved Votes: 43
Labels: RF36, former-quick-wins, perf-effort-xlarge, perf-improve-product, perf-urgency-asap, perf-value-essential
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-34027 Production issue under high load. Closed
is duplicated by SERVER-34027 Production issue under high load. Closed
is duplicated by SERVER-37541 MongoDB Not Returning Free Space to OS Closed
Problem/Incident
Related
related to SERVER-31380 Add metrics related to tcmalloc acqui... Closed
is related to SERVER-33296 Excessive memory usage due to heap fr... Backlog
Assigned Teams:
Product Performance
Sprint: Dev Tools 2019-05-06, Dev Tools 2019-04-22
Participants:
Case:

 Description   

tcmalloc may occasionally release large amounts of pageheap free memory to the kernel by calling madvise. This can take seconds when the amount of memory involved is many GB. A tcmalloc internal lock is held while this happens, so this can potentially stall many threads, causing widespread latency spikes.

There is no direct metric that diagnoses this (SERVER-31380 would provide that), but it can be indirectly inferred to be a likely cause from the following:

  • tcmalloc pageheap free memory decreases to near zero
  • tcmalloc unmapped memory increases by a corresponding amount
  • resident memory decreases by the same amount
  • system free memory increases by that amount


 Comments   
Comment by Ian Springer [ 27/Oct/23 ]

We are also hitting this issue in production and interested if there has been any progress.

We tried increasing the release rate to 5 and 10, and found that it prevented the stalls but also significantly impacted query latency. Setting it to 2 appeared to be a good compromise, but we haven't had a chance to test it extensively yet.

Comment by Jose Ledesma [ 21/Nov/22 ]

We are hitting this issue (server stalls while freeing up a big amount of page_heap_free_bytes). Any progress on this issue? Do we know if increasing the TCMALLOC_RELEASE_RATE may help to mitigate this issue?

Comment by Henrik Ingo (Inactive) [ 09/Aug/19 ]

We recently learned that TCMALLOC_RELEASE_RATE can be used to make tcmalloc release memory more frequently to the OS. We haven't tested anything that would resemble a repro of this issue, but I could speculate that using TCMALLOC_RELEASE_RATE=10 could cause tcmalloc to call madvise more frequently and in smaller chunks.

SERVER-42697 will add tcmalloc_release_rate tuning via setParameter.

Comment by Bruce Lucas (Inactive) [ 24/Oct/18 ]

An related effect is that mongod's reluctance to release memory to the o/s in a timely can cause two additional problems:

  • It exacerbates the effect of memory fragmentation (SERVER-33296), because memory fragmentation causes an accumulation of unused memory that could be released to the o/s
  • If mongod allocates a large amount of memory, e.g. to process a query, do an index build, etc., that memory will not be returned to the o/s, and this can hamper recovery. See for example HELP-7990.

I mention these issues because I think a fix for this ticket is likely to help with both of those issues.

Comment by Ian Whalen (Inactive) [ 05/Aug/18 ]

ping mark.benvenuto

Comment by Bruce Lucas (Inactive) [ 21/Mar/18 ]

This has been seen by another customer on 3.6.1 in SERVER-34027.

Comment by Bruce Lucas (Inactive) [ 19/Oct/17 ]

I wonder if we should surface TCMALLOC_AGGRESSIVE_DECOMMIT as a server parameter, possibly changeable at runtime, so users don't have to set an environment variable and restart to test the impact of enabling it. mark.benvenuto, acm?

Generated at Thu Feb 08 04:26:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.