- single collection, 100k documents (fits in memory)
- 12 cpus (6 cores)
- workload is n connections each querying a random single document in a loop by _id
- measured raw flat-out maximum capacity by using 150 connections each doing queries as fast as possible; similar numbers for WT (174k queries/s) and mmapv1 (204k queries/s)
- then measured simulated customer app by introducing delay in loop so that each connection executes 10 queries/s, and then ramped number of connections up to 10k, for an expected throughput of 10 queries/connection/s * 10k connections = 100k queries/s. This is well below (about half) the measured maximum raw capacity for both WT and mmapv1, so expect to be able to achieve close to 100k queries/s at 10k connections
- do achieve close to that for mmapv1 (75k queries/s), but only get about 25k queries/s for WT at 10k connections, and behavior is erratic
mmapv1

- max raw capacity is 204k queries/s (as described above, this is with 150 connections each issuing queries as fast as possible)
- as connections are rampled up to 10k connections, this time with each connection issuing only 10 queries/s, throughput behavior is excellent up to about 6k connections, some mild falloff above that
- at 10k connections getting about 75k queries/s (estimated by fitting the blue quadratic trendline), not too far below the expected 100k queries/s
WiredTiger

- max raw capacity is similar to mmapv1 at 174k queries/s (as described above, this is with 150 connections each issuing queries as fast as possible)
- but as connections are rampled up to 10k connections, this time each connection issuing only 10 queries/s, above about 3k connections behavior becomes erratic
- at 10k connections getting only about 25k queries/s (estimated by fitting the blue quadratic trendline), far below the expected 100k queries/s
Repro code:
function repro_setup() {
|
x = []
|
for (var i=0; i<100; i++)
|
x.push(i)
|
count = 100000
|
every = 10000
|
for (var i=0; i<count; ) {
|
var bulk = db.c.initializeUnorderedBulkOp();
|
for (var j=0; j<every; j++, i++)
|
bulk.insert({})
|
bulk.execute();
|
print(i)
|
}
|
}
|
|
function conns() {
|
return db.serverStatus().connections.current
|
}
|
|
function repro(threads_query) {
|
start_conns = conns()
|
while (conns() < start_conns+threads_query) {
|
ops_query = [{
|
op: "query",
|
ns: "test.c",
|
query: {_id: {"#RAND_INT": [0, 10000]}},
|
delay: NumberInt(100 + Math.random()*10-5)
|
}]
|
res = benchStart({
|
ops: ops_query,
|
seconds: seconds,
|
parallel: 10,
|
})
|
sleep(100)
|
}
|
}
|
|
|
Author:
{u'username': u'martinbligh', u'name': u'Martin Bligh', u'email': u'mbligh@mongodb.com'}
Message: SERVER-20091: turn network counters into atomics
(cherry picked from commit 4ac0bfb55d9cccbc6784ded607eae14312ec9bcc)
Branch: v3.0
https://github.com/mongodb/mongo/commit/9221fcee30937ae930464d55ce0e275dae6d1795
|
|
Author:
{u'username': u'martinbligh', u'name': u'Martin Bligh', u'email': u'mbligh@mongodb.com'}
Message: SERVER-20091: turn network counters into atomics
Branch: master
https://github.com/mongodb/mongo/commit/4ac0bfb55d9cccbc6784ded607eae14312ec9bcc
|
|
Found another spinlock that seems to be responsible for the remaining scalability issue and erratic behavior seen above, this one relating to network counter stats:
3325 0.0;1;;clone:111;start_thread:312;mongo::PortMessageServer::handleIncomingMsg:232;mongo::NetworkCounter::hit:152;lock:76;mongo::SpinLock::_lk:95;nanosleep:81
|
340 0.0;1;;clone:111;start_thread:312;mongo::PortMessageServer::handleIncomingMsg:232;mongo::NetworkCounter::hit:152;lock:76;mongo::SpinLock::_lk:87;sched_yield:81
|
226 0.0;1;;clone:111;start_thread:312;mongo::PortMessageServer::handleIncomingMsg:219;mongo::MessagingPort::recv:183;mongo::Socket::recv:762;mongo::Socket::unsafe_recv:772;mongo::Socket::_recv:784;recv:44;__libc_recv:33
|
In this single sample about 3k threads are sleeping in the spinlock guarding the network counters, while 340 threads are doing a busy wait calling sched_yield, creating severe CPU contention.
POC patch attached that eliminates the lock and uses atomic ints for the counters instead. As the following measurements show this, together with the fix for the session cache spinlock, seems to complete fix the scalability issues and essentially eliminates the erratic behavior. The downside of this approach is that there may be skew between the network counters (bytes in, bytes out, number of requests), but I think this is acceptable since the primary use of these counters is to compute rates over periods of typically 1s or more, so precise alignment of the counters at the instruction execution level is not important.
Remaining question is why this appears to impact WT more than mmap. I suspect that is because this is a CPU contention issue, and WT in general uses more CPU than mmap for the same task.
Red samples are with the previous fix for the session cache spinlock only; blue samples add NetworkCounters.patch.

|
|
Author:
{u'username': u'martinbligh', u'name': u'Martin Bligh', u'email': u'mbligh@mongodb.com'}
Message: SERVER-20091: Use mutex instead of spinlock to protect session cache
Branch: v3.0
https://github.com/mongodb/mongo/commit/604c22e9b45f4e08c042fb1857bb34ea20de0fb1
|
|
Seems like this should be safe to backport?
|
|
Author:
{u'username': u'martinbligh', u'name': u'Martin Bligh', u'email': u'mbligh@mongodb.com'}
Message: SERVER-20091: use mutex to protect WT session cache
Branch: master
https://github.com/mongodb/mongo/commit/c22b157cfcdc2f0a9c3018f656ade45cca8697d7
|
|
Replacing previous analysis using numbers reported by mongostat because it appears to me that mongostat sometimes provides unreliable data - it will for some reason report the same stats for several lines in a row, and I suspect this is underreporting periods of low throughput. New analysis based on my own monitoring code: as connections are ramped up, every 100 connections note delta time and delta inserts since last sample, and report connections and queries/s based on that.
Net is that mutex still substantially improves the situation, but does not quite achieve parity with mmapv1 at high connection counts. I suspect we are still suffering some from contention on the session cache. Following graph shows sampled connection counts and query rates based on above methodology; lines are quadratic trend lines to estimate throughput as a function of connection count.

|
|
(To reduce potential confusion, deleting comment about experiment that accidentally used wrong primitive, shared_mutex instead of mutex; results were not good, and not relevant to this ticket, albeit perhaps interesting - see attached mutex.png)
|
|
Setting slowms to a large value to avoid the observed contention on the mongod log improves things for both mmapv1 and WT:

- mmapv1 (blue dots and blue trend line) is now achieving nearly perfect speedup to 10k connections (see blue quadratic trendline), with only some scattered periods of low performance (blue dots).
- WT is also improved, but still falls substantially short of perfect speedup (red quadratic trendline) and exhibits very substantial variability (red dots). Presumably this is the other bottleneck identified above, the WT session cache.
|
|
Main thing I can see for context switches is this:
- 21.62% mongod [kernel.kallsyms] [k] __schedule
|
- __schedule
|
- schedule
|
- 87.50% schedule_timeout
|
sk_wait_data
|
tcp_recvmsg
|
inet_recvmsg
|
sock_recvmsg
|
SYSC_recvfrom
|
sys_recvfrom
|
system_call_fastpath
|
__libc_recv
|
- 12.50% futex_wait_queue_me
|
futex_wait
|
do_futex
|
sys_futex
|
system_call_fastpath
|
- __lll_lock_wait
|
- 50.00% mongo::runQuery(mongo::OperationContext*, mongo::QueryMessage&, mongo::NamespaceString const&, mongo::Message&)
|
mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&)
|
mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)
|
mongo::PortMessageServer::handleIncomingMsg(void*)
|
start_thread
|
- 50.00% mongo::AutoGetCollectionForRead::~AutoGetCollectionForRead()
|
mongo::runQuery(mongo::OperationContext*, mongo::QueryMessage&, mongo::NamespaceString const&, mongo::Message&)
|
mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&)
|
mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)
|
mongo::PortMessageServer::handleIncomingMsg(void*)
|
start_thread
|
|
|
Caught it in the low-performing state once with gdb. The following were the top 4 stacks with counts (all others were well below these numbers):
7840 clone:111;start_thread:312;mongo::PortMessageServer::handleIncomingMsg:219;mongo::MessagingPort::recv:183;mongo::Socket::recv:762;mongo::Socket::unsafe_recv:772;mongo::Socket::_recv:784;recv:44;__libc_recv:33
|
470 clone:111;start_thread:312;mongo::PortMessageServer::handleIncomingMsg:230;mongo::MyMessageHandler::process:205;mongo::assembleResponse:467;mongo::logger::LogstreamBuilder::~LogstreamBuilder:120;mongo::logger::LogDomain<...>::append:59;mongo::logger::RotatableFileAppender<...>::append:59;mongo::logger::RotatableFileWriter::Use::Use:237;unique_lock:290;lock:412;lock:59;__GI___pthread_mutex_lock:79;_L_lock_909;__lll_lock_wait:135
|
204 clone:111;start_thread:312;mongo::PortMessageServer::handleIncomingMsg:230;mongo::MyMessageHandler::process:205;mongo::assembleResponse:385;receivedQuery:210;mongo::runQuery:746;mongo::PlanExecutor::getNext:292;mongo::PlanExecutor::getNextSnapshotted:333;mongo::IDHackStage::work:128;mongo::BtreeBasedAccessMethod::findSingle:201;mongo::WiredTigerIndexUnique::newCursor:1001;WiredTigerIndexUniqueCursor:884;WiredTigerIndexCursorBase:596;mongo::WiredTigerCursor::WiredTigerCursor;mongo::WiredTigerRecoveryUnit::getSession:199;mongo::WiredTigerSessionCache::getSession:168;lock_guard:257;lock:76;mongo::SpinLock::_lk:87;sched_yield:81
|
157 clone:111;start_thread:312;mongo::PortMessageServer::handleIncomingMsg:230;mongo::MyMessageHandler::process:205;mongo::assembleResponse:385;receivedQuery:210;mongo::runQuery:746;mongo::PlanExecutor::getNext:292;mongo::PlanExecutor::getNextSnapshotted:333;mongo::IDHackStage::work:128;mongo::BtreeBasedAccessMethod::findSingle:201;mongo::WiredTigerIndexUnique::newCursor:1001;WiredTigerIndexUniqueCursor:884;WiredTigerIndexCursorBase:596;mongo::WiredTigerCursor::WiredTigerCursor;mongo::WiredTigerRecoveryUnit::getSession:199;mongo::WiredTigerSessionCache::getSession:168;lock_guard:257;lock:76;mongo::SpinLock::_lk:95;nanosleep:81
|
- most threads are waiting for work in recv; this is expected
- large number of threads waiting for access to log, presumably to log slow queries. This shouldn't be any different between mmapv1 and WT, although it may exacerbate the issue and may account for some of the falloff in mmapv1 at high thread counts. Will try with larger slowms to eliminate this.
- total of 361 threads serializing on access to the WT session cache (the last two stacks above differ only in where exactly they are waiting). This seems a likely candidate for our culprit. It appears we're using a spin lock, which can perform badly under heavy contention which is certainly the case here.
|
|
Ran several variations to look for factors that might make a difference, averaging the throughput over longer periods of time to reduce the (considerable) erratic variability. Variations:
- 3.1.7
- WT with wiredTigerConcurrentReadTransactions=infinity
- WT with wiredTigerConcurrentReadTransactions=32 (vs default 128)

No dramatic improvements clearly larger than the variability, but
- 3.1.7 might be a little better
- setting wiredTigerConcurrentReadTransactions to infinity might be a little worse
Next step is to do some profiling.
|
|
Among the issues lumped into "erratic behavior" above is connections failing, serverStatus takes a long time, and throughput swings wildly between a very low value and the expected value. You can see the latter aspect of that in the graphs, where the horizontal axis represents not only number of connections but also time because it was constructed by gradually increasing the number of connections while measuring both connection count and second-by-second throughput with mongostat.
|
Generated at Thu Feb 08 03:53:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.