[SERVER-20409] Negative scaling with more than 10K connections Created: 15/Sep/15 Updated: 07/Dec/16 Resolved: 26/Jan/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.0.7, 3.2.0-rc1 |
| Fix Version/s: | 3.2.3, 3.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Geert Bosch |
| Resolution: | Done | Votes: | 0 |
| Labels: | code-only | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Completed: | |||||||||||||
| Sprint: | Integration F (02/01/16) | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
This is a follow-on to
Storage engine was WT so this could be |
| Comments |
| Comment by Githook User [ 01/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: (cherry picked from commit d79994be3744cb8ae6fa03d775feae178ac69dbc) | ||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 26/Jan/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 25/Jan/16 ] | ||||||||||||||||||||||||||||||||||||||||
|
I made the trivial change to adjust the order of getting tickets and sessions:
This does indeed seem to help (red is after, blue before patch): | ||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 14/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
That could be explained if the futex had a lot of waiters and it were trying to wake them all. However in the context where it's called (sem_post), if I'm interpreting it correctly, I would have thought it would only wake one. On the other hand I think I've seen similar cases where an unlocking type of operation (don't recall the details) seemed unexpectedly slow when there were a lot of waiters. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 13/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Another intriguing perf trace (attempting 15K threads):
| ||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 12/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Ran with the standard glibc allocator, tcmalloc and jemalloc today. They all seemed to "elbow" at around the 8500 mark using the 10000men.js script attached. Rules out my hypothesis from yesterday | ||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Found a way to get perf to work with 10K threads:
| ||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Attaching a slightly tweaked version of Bruce's repro (10000men.js) that runs a bit faster and also prints stats after getting to it's max value. On my box taking it to 14K and letting it sit there seems to make it teeter on the brink of death | ||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Lots of threads seem to be waiting for a ticket (I stopped the whole thing with gdb at this point).
| ||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Seems unstable. Sometimes goes off before 10K
Spins like a politician:
| ||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Regarding the question about mmap: because of the collection-level locking in mmap the test as written will bottleneck at a very small number of threads on collection access. Tried a couple of things:
| ||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Perf callgraph doesn't work with 10,000 threads, but perf top does. With --setParameter enableReplSnapshotThread=false Without that, it gets to about 14K (might just be noise) then goes nuts:
Only change I made to bruce's repro script was to set it to 15K, then changed this from 100 to 50 because I'm impatient
| ||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
I've done some further diving into this issue today and looked to get tcmalloc's CPU profiler to tell us where performance is at when the performance dips. With profiling enabled I found that the inflection point was lower than the 10K it had been previously. Attached are 3 PDF files generated from the profiling. Many of these seem to be related to TCMALLOC, but i'm not sure if this relates to the allocator being the bottleneck or simply the overhead of the allocator in profiling showing up. In any event, it appears from this output that there may be a more systemic issue? Has anyone tested to see if this same problem occurs under MMAP? I'l be testing to see if a different allocator has an impact shortly. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Thanks for the update david.hows Could you adjust the startup of your script to match what Bruces script does? Which starts 10 new MongoDB connections every 100ms until the configured maximum is reached. Then take measurements once the system has reached a steady state? That should avoid any startup artifacts. | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 11/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Had a look to see if this was a purely ex-MongoDB issue. Using a similar methodology to what Bruce uses here for the tests, I found that there isn't any noticeable decay in throughput as the number of threads increases with one caveat. That all execution MUST be evenly spread out over the tracking period; If I was to try and run 20000 threads with 150000 ops/sec per core over 8 cores, then I must start execution of all 20K threads at varied times over the full course of the 1 second tracking period, otherwise I get a big startup jam and large (and varied) spikes in system CPU due to scheduling issues. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 30/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
My tests were run on a 12-core (24-cpu) machine. As I mentioned in the opening comment I measured the raw capacity of the system at low connection count and high op rate per connection (100-200 connections running flat out, CPU-limited I believe) to be 260 k inserts/s, whereas at high connection count (and low op rate per connection) it topped out and began scaling negatively at less than half that, 100 k inserts/s. Suggest you calibrate your system that way and make a similar comparison between total capacity at low and at high connection counts. If you're seeing a big spike in system CPU that sounds like a non-linearity that may well be related to the cause of this issue, and also sounds like the kind of thing that causes negative scaling - my guess would be spin loops calling sched_yield, which consumes lots of system CPU and can result in negative scaling because you get in effect a priority inversion: lots of threads calling sched_yield interfere with progress on the very thing they're waiting for. | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 30/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Further on this one. Ran again on 3.2 and again began seeing the degradation about the 9500 connection mark. Looking back the first "wobble" shows up about 8851, when there is a big spike in system CPU domain. The major drop-off occurs when my system CPU starts to be fully consumed. Suspect that my results are simply my system capping out on available processing capacity. Obvious question, how large are the servers bruce.lucas and martin.bligh are working with? I may need to spin up a larger server in order to start diving properly. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 30/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
We tested and knew this to be the curve for both 3.0 and 3.2 | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 29/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas, I checked and can reproduce this (with the knee beginning at between 9451 and 9551.5 connections) on MongoDB 3.2 Was this issue expected to effect only 3.0? | ||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 27/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Ran this vs 3.0.8 on my little linux machine. I observed the same decay as Bruce, but capped out around the 4.3K connection range (between 4351 and 4451 connections to be exact) under MongoDB 3.0.8 pre. And running with a recent compile of master i see the decay start at the 10K range as described. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Martin Bligh [ 15/Sep/15 ] | ||||||||||||||||||||||||||||||||||||||||
|
Very steep knee ... might be this in cursor cache? WiredTigerSession::releaseCursor
|