[SERVER-24268] Investigate jemalloc as alternative to tcmalloc Created: 24/May/16  Updated: 01/Feb/19  Resolved: 19/Jul/16

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Alexander Gorrod Assignee: Alexander Gorrod
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongo_jemalloc.patch    
Issue Links:
Depends
is depended on by SERVER-24483 Consider utilising a pool based memor... Closed
Related
related to SERVER-39325 Add support for "allocator=jemalloc" Backlog
is related to SERVER-31839 Investigate JEMalloc Performance Vers... Closed
Participants:

 Description   

The server currently uses tcmalloc by preference. There are some known issues with tcmalloc fragmentation and memory release semantics (see SERVER-23333 and SERVER-20306). It is worth investigating whether jemalloc offers any benefits over tcmalloc.



 Comments   
Comment by Alexander Gorrod [ 19/Jul/16 ]

I’ve run a lot of tests comparing the performance and memory consumption characteristics of TCMalloc (current default) and JEMalloc. The following comment summarizes my findings.

Tl;dr JEMalloc doesn’t currently deliver enough advantages to warrant the risk of switching default allocators. If we identify specific use cases that could benefit from a pool based memory allocator it is worth reconsidering the change.

I ran a number of different tests to assess the behavior differences. I ran full sets of correctness tests via both WiredTiger Jenkins testing and MongoDB Evergreen testing - none of which showed any issues with JEMalloc. Following is a quick summary of the performance test suites:

  • Evergreen system performance tests, which run a selection of moderately sized benchmarks. The different allocators showed no significant differences.
  • WiredTiger automated performance tests, which run a variety of high throughput workloads. The different allocators show minor variations, but neither is clearly better overall.
  • The YCSB benchmark with the workload described earlier in this ticket. The following table summarizes the differences:
Metric TCMalloc JEMalloc default JEMalloc decay
Resident Size (GB) 12.3 15.6 11.9
Virtual Size (GB) 12.6 105.6 14
Load time (sec) 503 461 440
Run time (sec) 1879 2064 1836
Run 99 Read Latency (ms) 70 63 61
Run 99 Update Latency (ms) 89 83 79
Load Max Latency (ms) 32,000 21,000 23,000
Run Max Read Latency (ms) 20,000 30,000 33,000
Run Max Update Latency (ms) 18,000 37,000 49,000

Note: how much difference the purge:decay setting makes for JEMalloc - without that setting JEMalloc is not a viable alternative.

Overall JEMalloc would be a reasonable substitute for TCmalloc in MongoDB, but the differences are not compelling enough to warrant a change at this stage. If we believe that a pool based allocator could benefit us in the future, it’s worth considering again.

One further note: The patch attached to this ticket adds JEMalloc into the MongoDB tree as an alternative allocator, enabled via "--allocator=jemalloc", it also exposes several basic JEMalloc allocator statistics and configuration options via MongoDB. i.e: if someone is picking this work up again, I believe it's worth reviewing the patch on this ticket before jumping in.

Comment by Alexander Gorrod [ 18/Jul/16 ]

A patch against MongoDB that adds JEmalloc as a compile time allocator option.

Comment by Alexander Gorrod [ 18/Jul/16 ]

I found that there is a compile/run option to JEMalloc that makes it behave more sensibly for our needs, which is "purge:decay". I'll report soon on comparisons between JEMalloc and tcmalloc - all measurements including JEMalloc are based on the 4.2.1 release built from source with "purge:decay" enabled.

Comment by Alexander Gorrod [ 10/Jun/16 ]

I've been running a YCSB workload comparing builds of MongoDB that use jemalloc to the built-in tcmalloc. The YCSB configuration I've been running is:

recordcount=10000000
operationcount=10000000
threadcount=20
workload=com.yahoo.ycsb.workloads.CoreWorkload
 
readallfields=true
 
readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0
 
requestdistribution=uniform

Which is designed to stress the memory allocator.

The jemalloc options I've tried configuring are:

Option name Default Configured Value Result
opt.narenas 32 (4xCPU) 8 No apparent change Reduce number of arenas shared amongst allocations
opt.tcache true false Makes throughput much slower Disables thread cache
opt.dss false primary No aparent change Uses sbrk rather than mmap
opt.lg_tcache_max 32k 4k No apparent change Max object size stored in thread cache
opt.lg_tcache_max 32k 256k No apparent change Max object size stored in thread cache
opt.lg_chunk 1MB Max cache usage reduced from about 16GB to 13GB Chunk size
opt.lg_chunk 512K Error with ENOMEM Chunk size

I have actual results, and will quantify the results next week, but wanted to save an overview here mostly for myself in the mean time.

Comment by Eric Milkie [ 24/May/16 ]

SERVER-16773 has some data on prior investigations with jemalloc.

Generated at Thu Feb 08 04:05:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.