[SERVER-30114] Monitor cumulative time spent in tcmalloc spin lock in serverStatus Created: 12/Jul/17  Updated: 30/Oct/23  Resolved: 09/Feb/18

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: 3.2.20, 3.4.14, 3.6.4, 3.7.2

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Mark Benvenuto
Resolution: Fixed Votes: 25
Labels: SWDI
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Java Source File load.java     PNG File run3.png     File spinlock.diff     PNG File spinlock.png     File spinlock2.diff    
Issue Links:
Backports
Depends
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.6, v3.4, v3.2
Sprint: Platforms 2018-01-29, Platforms 2018-02-12
Participants:
Case:

 Description   

We sometimes encounter allocator bottlenecks that currently require asking users to collect stack traces with gdb to diagnose. We could diagnose these bottlenecks from FTDC data if we counted cumulative time spent waiting for the primary lock involved. Typical stacks for such a bottleneck look like this:

#0  0x00000000014a6f85 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#1  0x00000000014a6e57 in SpinLock::SlowLock() ()
#2  0x00000000014a9393 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) ()
#3  0x00000000014b4d2a in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) ()

#0  0x00000000014a6f85 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#1  0x00000000014a6e57 in SpinLock::SlowLock() ()
#2  0x00000000014a8efd in tcmalloc::CentralFreeList::InsertRange(void*, void*, int) ()
#3  0x00000000014b4ed8 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) ()
#4  0x00000000014b4f7d in tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long) ()



 Comments   
Comment by Githook User [ 26/Feb/18 ]

Author:

{'email': 'mark.benvenuto@mongodb.com', 'name': 'Mark Benvenuto', 'username': 'markbenvenuto'}

Message: SERVER-30114 Add simple slow spinlock monitoring

(cherry picked from commit ba957823288d674f34c8345059f35f34cb4dd055)
Branch: v3.6
https://github.com/mongodb/mongo/commit/35df11d74e04899a01069951fe3bad9a8295859c

Comment by Githook User [ 20/Feb/18 ]

Author:

{'email': 'mark.benvenuto@mongodb.com', 'name': 'Mark Benvenuto', 'username': 'markbenvenuto'}

Message: SERVER-30114 Add simple slow spinlock monitoring

(cherry picked from commit ba957823288d674f34c8345059f35f34cb4dd055)
Branch: v3.2
https://github.com/mongodb/mongo/commit/bcf597ce00b63a04703e6a96414b5fab85e86d34

Comment by Githook User [ 20/Feb/18 ]

Author:

{'email': 'mark.benvenuto@mongodb.com', 'name': 'Mark Benvenuto', 'username': 'markbenvenuto'}

Message: SERVER-30114 Add simple slow spinlock monitoring

(cherry picked from commit ba957823288d674f34c8345059f35f34cb4dd055)
Branch: v3.4
https://github.com/mongodb/mongo/commit/f8969ad5eda7a5e3b753d01fa1861b645425c7e6

Comment by Githook User [ 09/Feb/18 ]

Author:

{'email': 'mark.benvenuto@mongodb.com', 'name': 'Mark Benvenuto', 'username': 'markbenvenuto'}

Message: SERVER-30114 Add simple slow spinlock monitoring
Branch: master
https://github.com/mongodb/mongo/commit/ba957823288d674f34c8345059f35f34cb4dd055

Comment by Bruce Lucas (Inactive) [ 24/Jan/18 ]

Once we introduce the metric we can see whether the workloads are generating significant tcmalloc spinlock contention, and if they are then we know they are useful tests for gauging any impact of introducing the metric.

Comment by Andrew Morrow (Inactive) [ 24/Jan/18 ]

I meant it the other way around - I want to use those workloads to ensure that introducing the metric has acceptable (hopefully negligible) performance overhead before we propose it upstream and/or merge it ourselves.

Comment by Bruce Lucas (Inactive) [ 24/Jan/18 ]

Once we've introduce the metric we can check those workloads to see if they're generating a signal.

Comment by David Daly [ 24/Jan/18 ]

We've seen interesting tcmalloc related behavior with the mixed_* workloads in core_workloads in the past. It has a large number of threads, with groups of threads doing different simple operations (find, update, delete, ...)

Comment by Bruce Lucas (Inactive) [ 23/Jan/18 ]

Simple performance test. Java load generator (attached) run as follows:

java -cp ... load host=localhost op=spinlock threads=200 tps=1e100 qps=1e100 dur=30 size=0 batch=1000

This is 200 threads all inserting tiny (_id-only) documents in batches of 1000. Standalone, no journal, no checkpoints, running on 12 cores. Alternating 30-second runs with and without spinlock monitor, restarting mongod each run:

As the spinlock monitor stats show we're hitting the spinlock moderately hard because we have 200 threads on 12 cores filling the cache with tiny documents; on average at all times we have 5-10 threads waiting for the spinlock.

No visible difference. Numerically, total number of inserts in each 30 second run:

with spinlock monitor: 21157000 20973000 21047000 21267000 21373000 20563000 21208000 20299000 20714000 21007000

without spinlock monitor: 20899000 21677000 20777000 21190000 20996000 21061000 21197000 20857000 21705000 21330000

With spinlock monitor average is about 1% less than without, which is about 0.6 standard deviations, so in the noise.

Comment by Bruce Lucas (Inactive) [ 22/Dec/17 ]

I've attached a new POC (spinlock2.diff) that measures the actual time spent in SpinLockDelay rather than just accumulating the supplied sleep times because the Linux design uses futex with wakeups so it may wake up before the supplied sleep time elapses. This requires getting the current time, but since this is only done if we also do an expensive delay I think it's unlikely to have a performance impact.

Comment by Bruce Lucas (Inactive) [ 21/Dec/17 ]

I think that we can get 90% of the diagnostic value with something much simpler

  • monitor only time spent in SpinLockDelay, as that is where I've seen most of the time spent when there is tcmalloc contention.
  • keep only a single counter for all tcmalloc spinlocks; this will allow us to determine when tcmalloc is the bottleneck.

I've attached a POC implementation; it's only a few lines of code.

Here's an example in use:

Left is the current default executor; right is the adaptive executor. We see some allocator contention in the first run, although it maxes out at about 37 threads out of 145 so it's probably not impacting performance much.

Would it be feasible to put this monitoring in?

Comment by Mark Benvenuto [ 18/Dec/17 ]

There are a few reasons why I do not want to add stats tracking on a permanent basis to MongoDB:

  1. It is expensive to maintain the counters at runtime. In my simple naive prototype, I think it was a 15-20% hit on performance to run with the counters. In any final solution, it adds a lot of contention to cache lines to maintain counters for each spinlock, or a more complicated solution needs to be implemented that avoids heavy cache line contention.
  2. It will be expensive to maintain going forward. The prototype I linked to was ~400 changed lines, and did not even support debug tcmalloc. In order to track spin lock statistics accurately, we need to ensure every call site is updated. The prototype is one way to do that, and since it affects every call site, it is is an expensive change to maintain across TCMalloc versions.
  3. There are a lot of counters to report information on. We would likely only ask customers to collect this information in FTDC post a problem. In retail tcmalloc, there are approximately 99 counters to log to FTDC. This includes the 88 individual counters for the CentralFreeList.
Comment by Bruce Lucas (Inactive) [ 20/Nov/17 ]

mark.benvenuto, acm, my apologies, but I think I neglected to bring this ticket up when we were reviewing tcmalloc work for 3.6. My concern is that we made a fairly major change in the way we use tcmalloc (disabling markThreadIdle), with an option to re-enable it if a particular customer workload proves to need it, but we don't have a good way to monitor whether tcmalloc has become a bottleneck. In my experience with customer issues typically tcmalloc bottlenecks here but leaves no trace in the metrics, making it difficult to diagnose. This metric would fix that.

I realize that it is probably too tight to get this into 3.6.0, particularly since we need to make sure this monitoring doesn't introduce performance problems itself, but would it be possible to do this early in 3.7 and backport it to 3.6.small?

Comment by Mark Benvenuto [ 12/Jul/17 ]

Prototype code that tracks tcmalloc SpinLock spinning by spinlock type:

https://github.com/markbenvenuto/mongo/commit/eda17993ec4638d38dc2c8d0e105082219c124e5

Generated at Thu Feb 08 04:22:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.