[SERVER-33296] Excessive memory usage due to heap fragmentation Created: 13/Feb/18  Updated: 25/Jan/24

Status: Backlog
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Ritesh Saigal
Resolution: Unresolved Votes: 8
Labels: malloc, memory-management, perf-effort-xlarge, perf-improve-product, perf-urgency-asap, perf-value-essential, tcmalloc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 4.3.3-rr100.png     PNG File fragmentation-3.6.2.png     PNG File fragmentation.png     PNG File image-2019-10-21-15-09-33-867.png    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-37541 MongoDB Not Returning Free Space to OS Closed
Related
related to SERVER-39325 Add support for "allocator=jemalloc" Backlog
related to SERVER-35046 Add parameter to set tcmalloc aggress... Closed
related to SERVER-31417 Improve tcmalloc when decommitting la... Backlog
Assigned Teams:
Product Performance
Operating System: Linux
Steps To Reproduce:

david.daly - Handing over to you per e-mail discussion.

Sprint: Dev Tools 2019-05-06, Dev Tools 2019-04-22
Participants:
Case:

 Description   

The changes described in SERVER-20306 eliminated a common source of memory fragmentation, but it can still occur for other reasons. Here's an example from a node undergoing initial sync:

Over time

  • allocated memory never exceeds 8 GB
  • but heap size and resident memory reach nearly 14 GB
  • this is due to an accumulation of pageheap_free_bytes

A common cause of this is a shifting distribution of allocated memory sizes, which leaves free pages dedicated to one size of buffer unable to be used for new memory requests because they are for a different size buffer.

Setting TCMALLOC_AGGRESSIVE_DECOMMIT can address this issue by causing tcmalloc to aggressively return the free pages to the o/s where they can then be re-used by tcmalloc to satisfy new memory requests. However can have an unacceptable negative performance impact. Is there a tweak to tcmalloc that can give us better behavior for workloads like this?



 Comments   
Comment by Eran Davidi [ 09/Dec/21 ]

Hi,

 

My name is Eran Davidi and my company is a customer of MongoDB atlas. 

I wanted to ask if there is an estimation on when this bug will be fixed since I see it is very old.

We are using Mongo 4.4.10 and suffering from this issue.

 

Best Regards,

Eran Davidi

Comment by Andrew Shuvalov (Inactive) [ 26/Aug/21 ]

I think the ultimate solution should be to migrate to the new per-cpu TCmalloc. I commented why I think it will be faster in the comments for PERF-2106 "Performance support for TCMalloc Evaluation". Using other malloc solutions should be considered as well, however new TCmalloc might still be the best.

Comment by Ian Whalen (Inactive) [ 16/Mar/18 ]

Reviewed this in needs triage but the high cost:benefit puts this outside of the Storage team's top 15 at the moment. So putting on backlog and will re-appraise later.

CC asya

Comment by Bruce Lucas (Inactive) [ 14/Feb/18 ]

Verified that the same behavior is still observed in 3.6.2:

Generated at Thu Feb 08 04:32:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.