[SERVER-20409] Negative scaling with more than 10K connections Created: 15/Sep/15  Updated: 07/Dec/16  Resolved: 26/Jan/16

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.0.7, 3.2.0-rc1
Fix Version/s: 3.2.3, 3.3.1

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Geert Bosch
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 10000men.js     PDF File after.pdf     PDF File early.pdf     PNG File image-6.png     PDF File near.pdf    
Issue Links:
Related
is related to SERVER-20303 Negative scaling at low thread count ... Closed
is related to SERVER-20185 Scaling issue at high connection coun... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Integration F (02/01/16)
Participants:

 Description   

This is a follow-on to SERVER-20185. See that ticket for test details, but in brief each connection is doing 10 inserts per second, and throughput is measured as the number of connections is slowly ramped up. At high connection counts negative scaling is observed. The same test at low connection count (100-200 connections) but no delay in the loop (i.e. each connection doing inserts flat out) yielded ~260k inserts/s, so the expected 150k inserts/s at 15k connections should be within the raw capacity of the system.

Storage engine was WT so this could be SERVER-20303, although that ticket involves large documents (100 kB) and this one small documents (30 B). We can retest when SERVER-20303 is fixed.



 Comments   
Comment by Githook User [ 01/Feb/16 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-20409: Get ticket before getting WT session

(cherry picked from commit d79994be3744cb8ae6fa03d775feae178ac69dbc)
Branch: v3.2
https://github.com/mongodb/mongo/commit/9802549c17701a160c7ca29870bbc057d07ed1ed

Comment by Githook User [ 26/Jan/16 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-20409: Get ticket before getting WT session
Branch: master
https://github.com/mongodb/mongo/commit/d79994be3744cb8ae6fa03d775feae178ac69dbc

Comment by Geert Bosch [ 25/Jan/16 ]

I made the trivial change to adjust the order of getting tickets and sessions:

diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp b/src/mongo/db/storage/wir
edtiger/wiredtiger_recovery_unit.cpp
index 316b482..c7fb320 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp
@@ -176,8 +176,6 @@ void WiredTigerRecoveryUnit::assertInActiveTxn() const {
 }
 
 WiredTigerSession* WiredTigerRecoveryUnit::getSession(OperationContext* opCtx) {
-    _ensureSession();
-
     if (!_active) {
         _txnOpen(opCtx);
     }
@@ -341,6 +339,7 @@ void WiredTigerRecoveryUnit::_getTicket(OperationContext* opCtx) {
 void WiredTigerRecoveryUnit::_txnOpen(OperationContext* opCtx) {
     invariant(!_active);
     _getTicket(opCtx);
+    _ensureSession();
 
     WT_SESSION* s = _session->getSession();
 

This does indeed seem to help (red is after, blue before patch):

Comment by Bruce Lucas (Inactive) [ 14/Nov/15 ]

That could be explained if the futex had a lot of waiters and it were trying to wake them all. However in the context where it's called (sem_post), if I'm interpreting it correctly, I would have thought it would only wake one. On the other hand I think I've seen similar cases where an unlocking type of operation (don't recall the details) seemed unexpectedly slow when there were a lot of waiters.

Comment by Martin Bligh [ 13/Nov/15 ]

Another intriguing perf trace (attempting 15K threads):

-  80.69%  mongod  [kernel.kallsyms]  [k] _raw_spin_lock                                       
     _raw_spin_lock                                                                            
     futex_wake                                                                                
     do_futex                                                                                  
     sys_futex                                                                                 
     system_call_fastpath                                                                      
     sem_post                                                                                  
     mongo::WiredTigerRecoveryUnit::_txnClose(bool)                                            
     mongo::WiredTigerRecoveryUnit::_commit()                                                  
     mongo::insertMultiVector(mongo::OperationContext*, mongo::OldClientContext&, bool, char co
     mongo::insertMulti(mongo::OperationContext*, mongo::OldClientContext&, bool, char const*, 
     mongo::_receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, char const
     mongo::receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, mongo::Mess
     mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mon
     mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)          
     mongo::PortMessageServer::handleIncomingMsg(void*)                                        
     start_thread                                                                              
+   5.07%    perf  [kernel.kallsyms]  [k] generic_exec_single                                  
+   4.58%  mongod  [kernel.kallsyms]  [k] update_stats_wait_end                                
+   3.74%  mongod  [kernel.kallsyms]  [k] rb_erase                                             
+   3.58%  mongod  [kernel.kallsyms]  [k] tcp_v4_do_rcv                                        
+   2.33%  mongod  [kernel.kallsyms]  [k] handle_edge_irq                                      
+   0.00%  mongod  [kernel.kallsyms]  [k] native_write_msr_safe                                
+   0.00%    perf  [kernel.kallsyms]  [k] native_write_msr_safe     

Comment by David Hows [ 12/Nov/15 ]

Ran with the standard glibc allocator, tcmalloc and jemalloc today.

They all seemed to "elbow" at around the 8500 mark using the 10000men.js script attached. Rules out my hypothesis from yesterday

Comment by Martin Bligh [ 11/Nov/15 ]

Found a way to get perf to work with 10K threads:

perf record -F 1 -a  --call-graph fp

-  20.85%   mongod  mongod               [.] __wt_txn_get_snapshot                                                                             
     __wt_txn_get_snapshot                                                                                                                     
     __session_begin_transaction                                                                                                               
     mongo::WiredTigerRecoveryUnit::_txnOpen(mongo::OperationContext*)                                                                         
     mongo::WiredTigerCursor::WiredTigerCursor(std::string const&, unsigned long, bool, mongo::OperationContext*)                              
     mongo::WiredTigerRecordStore::insertRecords(mongo::OperationContext*, std::vector<mongo::Record, std::allocator<mongo::Record> >*, bool)  
     mongo::Collection::_insertDocuments(mongo::OperationContext*, __gnu_cxx::__normal_iterator<mongo::BSONObj const*, std::vector<mongo::BSONO
     mongo::Collection::insertDocuments(mongo::OperationContext*, __gnu_cxx::__normal_iterator<mongo::BSONObj const*, std::vector<mongo::BSONOb
     mongo::insertMultiVector(mongo::OperationContext*, mongo::OldClientContext&, bool, char const*, mongo::CurOp&, __gnu_cxx::__normal_iterato
     mongo::insertMulti(mongo::OperationContext*, mongo::OldClientContext&, bool, char const*, std::vector<mongo::BSONObj, std::allocator<mongo
     mongo::_receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, char const*, std::vector<mongo::BSONObj, std::allocator<mo
     mongo::receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, mongo::Message&, mongo::CurOp&)                            
     mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&)                         
     mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)                                                          
     mongo::PortMessageServer::handleIncomingMsg(void*)                                                                                        
     start_thread                                                                                                                              
+   6.86%  swapper  [kernel.kallsyms]    [k] intel_idle                                                                                        
+   6.54%  swapper  [kernel.kallsyms]    [k] task_waking_fair                                                                                  
+   6.06%   mongod  libstdc++.so.6.0.19  [.] std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long)            
-   5.24%   mongod  mongod               [.] (anonymous namespace)::do_free_helper(void*, void (*)(void*), tcmalloc::ThreadCache*, bool)       
     (anonymous namespace)::do_free_helper(void*, void (*)(void*), tcmalloc::ThreadCache*, bool)                                               
     operator delete(void*)                                                                                                                    
     mongo::WiredTigerSession::getCursor(std::string const&, unsigned long, bool)                                                              
     mongo::WiredTigerCursor::WiredTigerCursor(std::string const&, unsigned long, bool, mongo::OperationContext*)                              
     mongo::WiredTigerIndex::insert(mongo::OperationContext*, mongo::BSONObj const&, mongo::RecordId const&, bool)                             
     mongo::IndexAccessMethod::insert(mongo::OperationContext*, mongo::BSONObj const&, mongo::RecordId const&, mongo::InsertDeleteOptions const
     mongo::IndexCatalog::_indexFilteredRecords(mongo::OperationContext*, mongo::IndexCatalogEntry*, std::vector<mongo::BsonRecord, std::alloca
     mongo::IndexCatalog::_indexRecords(mongo::OperationContext*, mongo::IndexCatalogEntry*, std::vector<mongo::BsonRecord, std::allocator<mong
     mongo::IndexCatalog::indexRecords(mongo::OperationContext*, std::vector<mongo::BsonRecord, std::allocator<mongo::BsonRecord> > const&)    
     mongo::Collection::_insertDocuments(mongo::OperationContext*, __gnu_cxx::__normal_iterator<mongo::BSONObj const*, std::vector<mongo::BSONO
     mongo::Collection::insertDocuments(mongo::OperationContext*, __gnu_cxx::__normal_iterator<mongo::BSONObj const*, std::vector<mongo::BSONOb
     mongo::insertMultiVector(mongo::OperationContext*, mongo::OldClientContext&, bool, char const*, mongo::CurOp&, __gnu_cxx::__normal_iterato
     mongo::insertMulti(mongo::OperationContext*, mongo::OldClientContext&, bool, char const*, std::vector<mongo::BSONObj, std::allocator<mongo
     mongo::_receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, char const*, std::vector<mongo::BSONObj, std::allocator<mo
     mongo::receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, mongo::Message&, mongo::CurOp&)                            
     mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&)                         
     mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)                                                          
     mongo::PortMessageServer::handleIncomingMsg(void*)                                                                                        
     start_thread                                                                        

Comment by Martin Bligh [ 11/Nov/15 ]

Attaching a slightly tweaked version of Bruce's repro (10000men.js) that runs a bit faster and also prints stats after getting to it's max value. On my box taking it to 14K and letting it sit there seems to make it teeter on the brink of death

Comment by Martin Bligh [ 11/Nov/15 ]

Lots of threads seem to be waiting for a ticket (I stopped the whole thing with gdb at this point).
Tried doubling number of tickets to no avail

Thread 10004 (Thread 0x7f6a73160700 (LWP 17674)):
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x0000000001316b82 in mongo::TicketHolder::waitForTicket (
    this=0x2272b78 <mongo::(anonymous namespace)::openWriteTransaction>)
    at src/mongo/util/concurrency/ticketholder.cpp:75
#2  0x00000000010fba87 in _getTicket (this=0x470e130, opCtx=<optimized out>)
    at src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp:327
#3  mongo::WiredTigerRecoveryUnit::beginUnitOfWork (this=0x470e130, opCtx=<optimized out>)
    at src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp:131
#4  0x0000000000cccc82 in WriteUnitOfWork (txn=<optimized out>, this=<optimized out>, 
    txn=<optimized out>) at src/mongo/db/operation_context.h:232
#5  mongo::insertMultiVector (txn=0x49c2060, ctx=..., keepGoing=false, ns=0x49b3814 "test.c", 
    op=..., begin=..., end=...) at src/mongo/db/instance.cpp:986
#6  0x0000000000ccd3ba in mongo::insertMulti (txn=0x49c2060, ctx=..., keepGoing=false, 
    ns=0x49b3814 "test.c", docs=..., op=...) at src/mongo/db/instance.cpp:1036
#7  0x0000000000ccd490 in mongo::_receivedInsert (txn=0x49c2060, nsString=..., 
    ns=0x49b3814 "test.c", docs=..., keepGoing=false, op=..., checkCollection=<optimized out>)
    at src/mongo/db/instance.cpp:1131
#8  0x0000000000cc7592 in mongo::receivedInsert (txn=0x49c2060, nsString=..., m=..., op=...)
    at src/mongo/db/instance.cpp:1175
#9  0x0000000000cc2112 in mongo::assembleResponse (txn=0x49c2060, m=..., dbresponse=..., 
    remote=...) at src/mongo/db/instance.cpp:569
#10 0x0000000000964dfa in mongo::MyMessageHandler::process (this=<optimized out>, m=..., 
    port=0x49e0460) at src/mongo/db/db.cpp:170
#11 0x000000000132beac in mongo::PortMessageServer::handleIncomingMsg (arg=0x49e0460)
    at src/mongo/util/net/message_server_port.cpp:229
#12 0x00007f6a7b04a182 in start_thread (arg=0x7f6a73160700) at pthread_create.c:312
#13 0x00007f6a7ad7747d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment by Martin Bligh [ 11/Nov/15 ]

Seems unstable. Sometimes goes off before 10K

9351    93766
9451    18130
9551    2855

Spins like a politician:

 92.03%  [kernel]                       [k] _raw_spin_lock
  4.10%  [kernel]                       [k] futex_wake
  0.70%  libxul.so                      [.] 0x00000000009b6fcc

Comment by Bruce Lucas (Inactive) [ 11/Nov/15 ]

Regarding the question about mmap: because of the collection-level locking in mmap the test as written will bottleneck at a very small number of threads on collection access. Tried a couple of things:

  • a separate collection for each group of threads. Bottlenecked on something very far short of 10k threads, both for mmap and wt. Suspect that is unrelated to this issue (but probably bears investigation as well).
  • change reads to writes to avoid cll. Did not reproduce the problem as seen here. There was some erratic behavior at higher thread counts, both for wt and mmap. Suspect again that is a different effect.
Comment by Martin Bligh [ 11/Nov/15 ]

Perf callgraph doesn't work with 10,000 threads, but perf top does.

With --setParameter enableReplSnapshotThread=false
I get to about 13K threads, then perf top shows us taking 13% of time in __wt_txn_get_snapshot.

Without that, it gets to about 14K (might just be noise) then goes nuts:

Samples: 20M of event 'cycles', Event count (approx.): 850942834350                                                                             
 24.22%  mongod                              [.] __wt_txn_get_snapshot
  6.81%  [kernel]                            [k] _raw_spin_lock

Only change I made to bruce's repro script was to set it to 15K, then changed this from 100 to 50 because I'm impatient

        // 10 new connections every 100ms
        sleep(50)

Comment by David Hows [ 11/Nov/15 ]

I've done some further diving into this issue today and looked to get tcmalloc's CPU profiler to tell us where performance is at when the performance dips. With profiling enabled I found that the inflection point was lower than the 10K it had been previously.

Attached are 3 PDF files generated from the profiling.
They show where the profiler found CPU was being consumed inside mongodb.

Many of these seem to be related to TCMALLOC, but i'm not sure if this relates to the allocator being the bottleneck or simply the overhead of the allocator in profiling showing up.

In any event, it appears from this output that there may be a more systemic issue? Has anyone tested to see if this same problem occurs under MMAP?

I'l be testing to see if a different allocator has an impact shortly.

Comment by Alexander Gorrod [ 11/Nov/15 ]

Thanks for the update david.hows Could you adjust the startup of your script to match what Bruces script does? Which starts 10 new MongoDB connections every 100ms until the configured maximum is reached. Then take measurements once the system has reached a steady state? That should avoid any startup artifacts.

Comment by David Hows [ 11/Nov/15 ]

Had a look to see if this was a purely ex-MongoDB issue.

Using a similar methodology to what Bruce uses here for the tests, I found that there isn't any noticeable decay in throughput as the number of threads increases with one caveat. That all execution MUST be evenly spread out over the tracking period;

If I was to try and run 20000 threads with 150000 ops/sec per core over 8 cores, then I must start execution of all 20K threads at varied times over the full course of the 1 second tracking period, otherwise I get a big startup jam and large (and varied) spikes in system CPU due to scheduling issues.

Comment by Bruce Lucas (Inactive) [ 30/Oct/15 ]

My tests were run on a 12-core (24-cpu) machine. As I mentioned in the opening comment I measured the raw capacity of the system at low connection count and high op rate per connection (100-200 connections running flat out, CPU-limited I believe) to be 260 k inserts/s, whereas at high connection count (and low op rate per connection) it topped out and began scaling negatively at less than half that, 100 k inserts/s. Suggest you calibrate your system that way and make a similar comparison between total capacity at low and at high connection counts.

If you're seeing a big spike in system CPU that sounds like a non-linearity that may well be related to the cause of this issue, and also sounds like the kind of thing that causes negative scaling - my guess would be spin loops calling sched_yield, which consumes lots of system CPU and can result in negative scaling because you get in effect a priority inversion: lots of threads calling sched_yield interfere with progress on the very thing they're waiting for.

Comment by David Hows [ 30/Oct/15 ]

Further on this one.

Ran again on 3.2 and again began seeing the degradation about the 9500 connection mark.

Looking back the first "wobble" shows up about 8851, when there is a big spike in system CPU domain. The major drop-off occurs when my system CPU starts to be fully consumed.

Suspect that my results are simply my system capping out on available processing capacity. Obvious question, how large are the servers bruce.lucas and martin.bligh are working with? I may need to spin up a larger server in order to start diving properly.

Comment by Daniel Pasette (Inactive) [ 30/Oct/15 ]

We tested and knew this to be the curve for both 3.0 and 3.2

Comment by David Hows [ 29/Oct/15 ]

bruce.lucas, I checked and can reproduce this (with the knee beginning at between 9451 and 9551.5 connections) on MongoDB 3.2

Was this issue expected to effect only 3.0?

Comment by David Hows [ 27/Oct/15 ]

Ran this vs 3.0.8 on my little linux machine. I observed the same decay as Bruce, but capped out around the 4.3K connection range (between 4351 and 4451 connections to be exact) under MongoDB 3.0.8 pre.

And running with a recent compile of master i see the decay start at the 10K range as described.

Comment by Martin Bligh [ 15/Sep/15 ]

Very steep knee ... might be this in cursor cache?

WiredTigerSession::releaseCursor

    while (_cursorGen - _cursors.back()._gen > 10000) {
        cursor = _cursors.back()._cursor;
        _cursors.pop_back();
        _cursorsCached--;
        invariantWTOK(cursor->close(cursor));
    }

Generated at Thu Feb 08 03:54:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.