[SERVER-16773] Performance degradation due to TCMalloc scalability Created: 08/Jan/15  Updated: 03/Dec/21  Resolved: 26/Jan/15

Status: Closed
Project: Core Server
Component/s: Performance, Storage
Affects Version/s: 2.8.0-rc4
Fix Version/s: 3.0.0-rc7

Type: Bug Priority: Major - P3
Reporter: John Page Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mongodb using the TCMalloc memory allocator.


Attachments: PNG File MMAP-26.csv.ops.png     PNG File MMAP-28.csv.ops.png     PNG File WorkloadSuspension.png     PNG File rc4.png    
Issue Links:
Duplicate
is duplicated by SERVER-16879 Degraded performance on rc5 (mms-prod... Closed
Related
related to SERVER-16131 Log File Blowing up on sharded ycsb r... Closed
related to SERVER-20104 WT high memory usage due to high amou... Closed
related to SERVER-16879 Degraded performance on rc5 (mms-prod... Closed
is related to SERVER-22763 Investigate performance of new gperft... Closed
is related to SERVER-31839 Investigate JEMalloc Performance Vers... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Performance degradation due to TCMalloc scalability with very large number of threads. After a few minutes of running with hundreds of threads, the large majority of CPU time is spent scavenging memory. As this often occurs in critical sections, system throughput may degrade by an order of magnitude or more. Increasing the TCMalloc thread cache to its maximum of 1 GB does not avoid this problem. Using the system allocator does, but costs performance in most other cases.



 Comments   
Comment by deyukong [ 23/Dec/17 ]

Agree with @Igor Canadi
We also met this situation.
After about one day's running with about 4K threads. tcmalloc will somehow rank in front in perf. and mongod costs much more cpu than the normal ones.
functions about CentralFreeList ranks top.
After switching to jemalloc, mongod uses more physical memory, but it doesn't stale any more.

Btw, changing tcmalloc's max_thread_cache offers no help.
So much memory is left when mongod stales, we find it in serverStatus().

Comment by Igor Canadi [ 06/Oct/15 ]

The same issue was observed by Ceph: https://ceph.com/planet/the-ceph-and-tcmalloc-performance-story/

Switching to jemalloc helped them, too.

Comment by Igor Canadi [ 06/Oct/15 ]

We encountered similar issue with MongoRocks. 35% of the CPU was being spent in tcmalloc, usually in functions related to CentralFreeList (meaning that thread local cache couldn't fulfill the request). There were a lot of context switches, which indicates lock contention, likely in CentralFreeList. We were running 3.0.6 version with latest Eliot's commit, meaning that cache size was configured to be 1GB.

Switching to jemalloc, CPU spent on malloc/free went down to ~4% and latency improved dramatically.

Comment by Githook User [ 28/Jan/15 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16773: Increase TCMalloc default cache size to 1G

(cherry picked from commit 9da40029fef37df8d33218101ffa2ff22d94a2da)
Branch: v3.0
https://github.com/mongodb/mongo/commit/5d71769398636406ef11f850ef6df163a47bb902

Comment by Githook User [ 28/Jan/15 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16773: Increase TCMalloc default cache size to 1G
Branch: master
https://github.com/mongodb/mongo/commit/9da40029fef37df8d33218101ffa2ff22d94a2da

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-16773: Increase TCMalloc default cache size to 256MB

(cherry picked from commit c30688f704e3fbde4ee83aa2f45a6d79900f10c9)
Branch: v3.0
https://github.com/mongodb/mongo/commit/a0ad9d0380bbe874460151cfa3901b52af28e7c6

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-16773: Increase TCMalloc default cache size to 256MB
Branch: master
https://github.com/mongodb/mongo/commit/c30688f704e3fbde4ee83aa2f45a6d79900f10c9

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-16773 Get rid of ScopedLock and 2 memory allocations

This change removes ScopedLock from the RAII lock objects' hierarchy. This
eliminates two memory allocations and two acquisitions of the Parallel
Batch Writer mutex.

I did not see any significant performance improvement, but next change
would be to remove the allocation of the PBR mutex as well and also to
make WriteBatchExecutor not allocate lock objects.

(cherry picked from commit fe3e92d4257b30f01b62d4ef941686b7e0138a8c)
Branch: v3.0
https://github.com/mongodb/mongo/commit/0649555aa3fae26e6770a3186e2766c26fd0cfcf

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-16773 Get rid of ScopedLock and 2 memory allocations

This change removes ScopedLock from the RAII lock objects' hierarchy. This
eliminates two memory allocations and two acquisitions of the Parallel
Batch Writer mutex.

I did not see any significant performance improvement, but next change
would be to remove the allocation of the PBR mutex as well and also to
make WriteBatchExecutor not allocate lock objects.
Branch: master
https://github.com/mongodb/mongo/commit/fe3e92d4257b30f01b62d4ef941686b7e0138a8c

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16773: better tcmalloc cleanup

(cherry picked from commit c4ec7db6ce25e7147ae5f46ac0a7f6a52a0b4c3e)
Branch: v3.0
https://github.com/mongodb/mongo/commit/3ef65d17fd8081cb65330bf90bffd093e7fd020b

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16773: better tcmalloc cleanup
Branch: master
https://github.com/mongodb/mongo/commit/c4ec7db6ce25e7147ae5f46ac0a7f6a52a0b4c3e

Comment by Githook User [ 22/Jan/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-16773: Mark threads idle for tcmalloc
Branch: master
https://github.com/mongodb/mongo/commit/4c83a604004c329ac114c53df38ac96421ebcf83

Comment by John Page [ 19/Jan/15 ]

Does rc4 and rc5 show a difference?

{ name : "John Page",
title : "Senior Solutions Architect",
phone : "(44) 755 737 1664",
location : "Glasgow, UK"

}

On Mon, Jan 19, 2015 at 6:13 PM, Geert Bosch (JIRA) <jira@mongodb.org>

Comment by John Page [ 19/Jan/15 ]

If you want me to retest I'm happy to do so. 2.8-rc5?

Comment by Geert Bosch [ 19/Jan/15 ]

I just redid a run with a build of vanilla rc5 without any patches and when doing a 1500 second run didn't see any performance anomalies, will attach results graph.

server build info:
> db.serverBuildInfo()
{
"version" : "2.8.0-rc5",
"gitVersion" : "74b351de21c84438b12a83b28e155f5e69e3c1eb",
"OpenSSLVersion" : "",
"sysInfo" : "Linux gouda 3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49",
"loaderFlags" : "-fPIC -pthread -Wl,-z,now -rdynamic",
"compilerFlags" : "-Wnon-virtual-dtor -Woverloaded-virtual -std=c++11 -fPIC -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -pipe -Werror -O3 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-but-set-variable -Wno-missing-braces -fno-builtin-memcmp -std=c99",
"allocator" : "tcmalloc",
"versionArray" : [
2,
8,
0,
-5
],
"javascriptEngine" : "V8",
"bits" : 64,
"debug" : false,
"maxBsonObjectSize" : 16777216,
"ok" : 1
}
>

Comment by John Page [ 19/Jan/15 ]

I don't have any answers for that one.

Comment by Geert Bosch [ 19/Jan/15 ]

OK, thanks. I'll resolve that, however this does not seem to invalidate the results so far. Do you have any idea why the first 450 seconds of a 600 second run would look different from the first 450 seconds of a 1500 second run?

Comment by John Page [ 17/Jan/15 ]

What are you seeing is where you don't have enough client side file handles
in unlimited. Processes in the loader die after n seconds unless there is
an error from the server.

Comment by John Page [ 17/Jan/15 ]

That's the C driver code for not enough available file handles.

Comment by Geert Bosch [ 17/Jan/15 ]

There seems a problem with the load generator: the first 500 seconds look much better if I specify a test duration of 1500 seconds. There also an issue with not all threads completing always: if I specify 1000 threads the test might finish with 250 or so. Error message is: "Failed to read 4 bytes from socket. Child Quitl." The test system has plenty of resources to run and is never overloaded.

Anyway, I get good throughput now with regular journaling/yielding etc. enabled for a 1500 second (25 min) test run.

Comment by David Daly [ 16/Jan/15 ]

I tried John's workload against RC4. Interesting behavior.

After about 3-4 minutes, the performance increases. I've labelled that as time B. Test starts at A. Seems stable after B.

Comment by David Daly [ 16/Jan/15 ]

Short summary:

  • Query only workload does not show degradation
  • Update only workload shows a slow degradation over time
Comment by John Page [ 16/Jan/15 ]

I will depend what version of that code you have.
Older code opened another X connections to read the test to see if it had changed then dropped them again - newer ones use the same connection throughout so with 1000 threads its a constant 1001 connections not 1001->2001->1001

Comment by David Daly [ 16/Jan/15 ]

I think the spike in inserts is an artifact of the restart and the workload reconnection. I think the workload fails to get the test specification from the database, and then runs the default workload for a little while, until it suceeds in querying the server for the current test configuration.

Comment by David Daly [ 16/Jan/15 ]

Looking at the stats, I see the connections go to zero when I restart it, but then it goes back to where it was after I resume the workload.

Comment by David Daly [ 16/Jan/15 ]

In case it's interesting, here's a shot of the spike in inserts after restart

  • A: Start workload with 512 threads
  • B: Turn off inserts
  • C: Suspend workload
  • D: Resume workload
  • E: Suspend workload and restart server
  • F: Resume workload

Also, opcounters for query and update seem higher than documents returned and documents updated.

Comment by John Page [ 16/Jan/15 ]

I don't think the workload will reconnect if you restart the server though

  • it build a connection set up front.

{ name : "John Page",
title : "Senior Solutions Architect",
phone : "(44) 755 737 1664",
location : "Glasgow, UK"

}

Comment by David Daly [ 16/Jan/15 ]

Thanks john.page.

Trying a slightly different tack now. Doing a long duration run, and the suspending the workload process to restart the server.
One interesting thing (maybe artifact) is I see a burst of inserts after unsuspending the workload. Going to try various loads and see what I can find.

Comment by John Page [ 16/Jan/15 ]

Query and update only access records inserted in that run so a query update only workload isn't doing much.

Comment by David Daly [ 16/Jan/15 ]

The workload can be adjusted between inserts, queries, and updates by adjusting the entry in the testsrv.test collection on the system under test. There is one record with _id : "loadtest". The number of operations of each type is proportional to the value in each field. Set "insert" : 0 to stop all inserts. Fields can be updated while the test is running.

Some initial observations from adjusting the running load (all with 512 threads)

  • Ran the default workload and saw the performance drop off.
  • Set inserts to zero: saw the inserts stop, but the performance of the queries and updates did not appreciably improve.
  • After idling the system, the mixed query/update workload without inserts showed significantly higher performance.
  • The query/update workload performance is not dropping off after the idle period. 10 minute run
  • Restarted server and ran quey/update workload. It showed stable performance.
Comment by Geert Bosch [ 15/Jan/15 ]

Interestingly, jemalloc doesn't get as bad but shows same shape. LockLessInc malloc is similar as well.

Comment by Geert Bosch [ 15/Jan/15 ]

TCMalloc 2.0 has similar behavior as well.

Comment by Geert Bosch [ 15/Jan/15 ]

TCMalloc 2.4 has same behavior as the 2.2 version we are using now. Will be trying 2.0 next.

Comment by Daniel Pasette (Inactive) [ 10/Jan/15 ]

John, can you include the git hash of the mongod binaries you're testing as well as the parameters you're passing to mongod?

Generated at Thu Feb 08 03:42:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.