[SERVER-31694] 17% throughput regression in insert workload Created: 24/Oct/17  Updated: 06/Dec/22  Resolved: 28/Oct/22

Status: Closed
Project: Core Server
Component/s: Storage, Write Ops
Affects Version/s: 3.6.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Replication Team
Resolution: Incomplete Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File ensure-session.patch     File pmp.out.3513     File pmp.out.359     File pmp.patch.sesscache     PNG File run2.png    
Issue Links:
Depends
depends on SERVER-31902 Reduce contention on ShardingState::_... Closed
depends on SERVER-31879 remove extraneous synchronization in ... Closed
depends on SERVER-31882 For vectored inserts, start transacti... Closed
depends on SERVER-31899 Improve getTerm() performance Closed
depends on SERVER-31912 waitUntilDurable() should use its own... Closed
Related
related to SERVER-34604 low performance under 3.6.4 with wire... Closed
related to SERVER-40250 High contention for ReplicationCoordi... Closed
related to SERVER-59776 50% regression in single multi-update Closed
related to SERVER-34562 I use YCSB testing comparise mongodb3... Closed
Assigned Teams:
Replication
Sprint: Storage 2017-11-13, Storage 2017-12-04, Storage 2018-01-29, Repl 2018-02-26, Repl 2018-03-12, Repl 2018-03-26
Participants:
Case:
Linked BF Score: 0

 Description   

Simple insert workload:

    res = benchRun({
        ops: [
            {op: "insert", doc: {},  ns: "test.c"},
        ],
        seconds: 45,
        parallel: 15
    })
    printjson(res)

standalone shows ~11% throughput reduction

3.4.9:     "totalOps/s" : 312257.31243007537,
3.6.0-rc0  "totalOps/s" : 277214.2306740861,

1-node replica set shows ~50% throughput reduction

3.4.9:     "totalOps/s" : 181808.21357151057,
3.6.0-rc0  "totalOps/s" : 91202.31599375796,

Metrics for the 1-node replica set case suggest issue may be related to large number of write-lock calls, about 3 per insert:



 Comments   
Comment by Judah Schvimer [ 28/Oct/22 ]

This regression is between two EOL versions of MongoDB, and the Server has diverged significantly since this occurred. Bulk insert performance continues to be something we track and aim to improve over time. Future performance improvements will be made using new analyses on active versions of the Server. Closing this as "Incomplete".

Comment by Andy Schwerin [ 20/Mar/19 ]

Great analysis, bartle. setMyLastAppliedOpTimeForward and more generally the replication coordinator mutex are known contention points in the system, today. That said, I don't see a ticket specifically tracking it, and I think this ticket actually covers a separate case involving w:1 bulk insert performance. I have filed SERVER-40250 to track the issue you report in your comment above.

CC alyson.cabral and sheeri.cabral

Comment by David Bartley [ 20/Mar/19 ]

Are there plans to improve performance of setMyLastAppliedOpTimeForward?  On a write-majority, insert-heavy workflow we basically see single-core contention on setMyLastAppliedOpTimeForward (based on a CPU profile).  That particular function takes an exclusive mutex, so it's unsurprising that if you push enough write-majority writes through you'd contend on a single core (in practice we're hitting a bottleneck of 12k wps, on a 16-core machine, with ~7% CPU usage).

Ultimately. all of the CPU ends up in _wakeReadyWaiters_inlock.  That particular implementation seems rather naive; it ends up recomputing a bunch of things (again, under a global, exclusive lock) for every replication waiter.  Instead, it seems like you should structure this code such that it determines the largest optime that satisfies the various write concern modes (basically "majority" and w="N") once, and then pass that information down into _doneWaitingForReplication_inlock.

Beyond this, reading through the code, it's fairly concerning how coarse-grained _mutex on ReplicationCoordinatorImpl is.  Is there a reason more work hasn't been invested in finer-grained locks, or even reader-writer locks?  As-is, it's really difficult to make any perf improvements.

Comment by VictorGP [ 08/Jun/18 ]

Yes, this is our case, i guess upserts are also affected. We have a heavy write pattern so bulk inserts are needed.

Isn't this usage of bulk inserts a common operation among MongoDB users? I hardly believe most users will use simple inserts when they have the option of doing them bulked.

 

Comment by Andy Schwerin [ 08/Jun/18 ]

victorgp and stutiredboy@gmail.com, this ticket tracks a very particular performance regression when clients are using the Bulk.insert or related driver functionality to efficiently insert a large volume of documents. Is this the action you're performing when you experience your performance problem? If not, I would suggest filing another ticket.

As for this particular performance regression, we hope to address it this year.

Comment by Adun [ 07/Jun/18 ]

Seconded.

The declining performance is a big problem for us,

We have already decided to give up the 3.6.x version but staying on 3.4.

Comment by VictorGP [ 06/Jun/18 ]

Can we get an update on how this is going and when is it going to be released? Such a performance impact in a new MongoDB version should be a big concern. This was reported in October 2017, it has Major priority but after ~8months there is not a fix yet, and that is a bit concerning.

In our case, we can't upgrade our infrastructure to 3.6, the performance degradation is too big for an optimal database.

Also, last comment is from 1st Feb, hope this was not forgotten...

Comment by Andy Schwerin [ 01/Feb/18 ]

Moving to "Needs Triage" so replication team can assess when/how to schedule this. Given that it was in 3.7 Required, I expect it'll be going back to 3.7 Required as part of the re-triage.

Comment by Susan LoVerso [ 01/Feb/18 ]

I have made changes in WiredTiger develop to alleviate blocking on the timestamp queue locks. In WiredTiger-only timestamp tests I see 12% improvement. When I run that against this workload I see no improvement. Pmp is still showing a lot of contention and I frequently see the thread holding the lock in the now call, presumably gettimeofday. I don't believe WiredTiger is the underlying cause and reducing the lock contention in repl::getNextOpTimes and ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward seems to be the next step.

Comment by Susan LoVerso [ 18/Jan/18 ]

Running the mongodb 3.6 branch against the WiredTiger 3.8 branch shows a small, but very consistent 3% degradation in the insert workload (about 96.7k) compared to using the WiredTiger 3.6 branch. I am going to do some bisecting to see if it is a single change that accounts for it.

ETA: The change that causes the degradation was the addition of lock tracking statistics around the three rwlocks in WiredTiger in the transaction subsystem: the global rwlock, and the two rwlocks around the two timestamp queues. Those changes were added in WT-3725. I have opened WT-3857 to consider whether those should be removed or modified in some way.

If I turn off tracking on those three locks, then the WiredTiger 3.8 branch is on par with the WiredTiger 3.6 branch when run with the MongoDB 3.6 branch.

Comment by Susan LoVerso [ 17/Jan/18 ]

Since this is back on my plate I did baseline runs to see what the current situation is. Given above discussion, I ran on 3.4 with majority read concern turned on. Here are my results (avg of 3 runs):

Branch Changeset Insert ops/sec
3.4.9 RC ce08a408 113.3k
3.6 85cf50d91d 99.4k

So the 3.6 performance has not changed since my last runs back in November. I ran pmp on several runs to see if any patterns emerge from the stacks. I continue to see lots of contention in setMyLastAppliedOpTimeForward and getNextOpTime. In particular, in every run I see the thread holding the lock from setMyLastAppliedOpTimeForward in _replExecutor->now(). Here's a good illustration, as the test is using insert 15 threads and here are 14 of them:

     13 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock,unique_lock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 mongo::executor::ThreadPoolTaskExecutor::now,mongo::repl::ReplicationCoordinatorImpl::_setMyLastAppliedOpTime_inlock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()

In another run, I also see some contention on the WiredTiger commit timestamp queue write lock. Eight threads are waiting on the mutex in getNextOpTimes and the thread holding it has the commit timestamp queue lock but other threads in WiredTiger are blocked on that lock:

      8 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 __wt_writeunlock,__wt_txn_set_commit_timestamp,__wt_txn_set_timestamp,__session_timestamp_transaction,mongo::WiredTigerRecoveryUnit::setTimestamp,mongo::WiredTigerRecordStore::oplogDiskLocRegister,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 sched_yield,__wt_yield,__wt_writelock,__wt_txn_clear_commit_timestamp,__wt_txn_release,__wt_txn_commit,__session_commit_transaction,mongo::WiredTigerRecoveryUnit::_txnClose,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait_signal,__wt_cond_wait,__wt_writelock,__wt_txn_clear_commit_timestamp,__wt_txn_release,__wt_txn_commit,__session_commit_transaction,mongo::WiredTigerRecoveryUnit::_txnClose,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()

Comment by Bruce Lucas (Inactive) [ 11/Dec/17 ]

Situation on a 2-node repl set is similar:

3.4.10 default settings             "totalOps/s" : 162961.08065147756,
3.4.10 --enableMajorityReadConcern  "totalOps/s" : 149420.53540043414,
3.6.0  default settings             "totalOps/s" : 131686.83374001557,

Comment by Bruce Lucas (Inactive) [ 11/Dec/17 ]

In the test on this ticket enabling majority read concern in 3.4.10 does decrease performance some, by about 5%, but not by the 17% decrease seen at the default settings in 3.6.0.

3.4.10 --enableMajorityReadConcern  "totalOps/s" : 174735.2184719974,

Comment by Geert Bosch [ 11/Dec/17 ]

The picture changes a bit when you compare with 3.4 with majority reads enabled. IIRC, we're about at parity there. Of course, we now have enabled something by default that users may care less about than 17% perf difference.

Comment by Bruce Lucas (Inactive) [ 11/Dec/17 ]

Status update: 3.6.0 is much improved compared to the test on 3.6.0-rc0 in the original description, but still shows ~17% performance regression compared to 3.4.10:

3.4.10:    "totalOps/s" : 183103.84044700526,
3.6.0-rc0  "totalOps/s" : 91202.31599375796,
3.6.0:     "totalOps/s" : 151687.90105520733,

Comment by Michael Cahill (Inactive) [ 28/Nov/17 ]

Reopening – since this workload isn't automated (yet), we need to keep this ticket open to make sure we drive down the remaining regression.

Comment by Eric Milkie [ 28/Nov/17 ]

I'm closing this ticket now, as we have no remaining immediate server changes planned to improve this general performance issue. We will continue working on the individual linked BF tickets, and such work may spawn new SERVER tickets if the work is to be done in the server.

Comment by Susan LoVerso [ 27/Nov/17 ]

I have run with the current mongodb master as of this morning (61122064) and WT develop (00374e0c7) and I'm seeing a solid 10%+ improvement from last week. These results are from a single-node replica set:

Branch insert TTL test
3.4.9 126k 255k
Nov20 89k 125k
Nov27 99k 140k

Running pmp (twice) on the current code during the insert test now shows consistently about 5 insert threads waiting on the mutex in getNextOpTimes, 1 or 2 other insert threads waiting on the ReplicationCoordinator mutex and all the other threads doing real work. That is a big improvement!

Comment by Eric Milkie [ 21/Nov/17 ]

I've been working on making the calls to _setStableTimestampForStorage_inlock() more efficient, on the primary write path, but am seeing mixed results. In some cases it performs better and in some cases, especially when there is cache pressure, the performance is much worse. I'm continuing to investigate.

Comment by Susan LoVerso [ 21/Nov/17 ]

That is correct.

Comment by Michael Cahill (Inactive) [ 21/Nov/17 ]

sue.loverso, just to make sure we're on the same page: MongoDB 3.6 is not setting WiredTiger's stable_timestamp at all – all checkpoints write with a current transaction snapshot and no read timestamp.

Comment by Susan LoVerso [ 20/Nov/17 ]

If I understand correctly, the stable timestamp is not meaningful when we're primary,

Internally to WiredTiger the stable timestamp is meaningful on every machine as it is a visibility mechanism for checkpoints and eviction. So it must be updated frequently enough on a primary as well.

Comment by Geert Bosch [ 17/Nov/17 ]

There are various slowdowns, but the one that stands out is ReplicationCoordinatorImpl::_setStableTimestampForStorage_inlock, as it is quite expensive (about 20% as expensive as _insertRecords, which inserts and indexes the actual data), but executes inside of the ReplicationCoordinator mutex.

If I understand correctly, the stable timestamp is not meaningful when we're primary, as any majority confirmed timestamp is stable. When we're secondary, we only try to advance the stable timestamp once per batch, so it is not performance critical.

Comment by Susan LoVerso [ 13/Nov/17 ]

I reran with the latest changes in mongodb master from several tickets and the latest drop of WiredTiger. Running changeset 9fbb13033 and using the internal wiredtiger code I see the same performance gains that we saw with the hacked changes, about 91K ops/sec.

Running pmp shows 13 of 15 workload threads in one stack:

     13 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock,unique_lock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()

Comment by Eric Milkie [ 10/Nov/17 ]

I linked two more tickets to help improve performance. After these tickets go in, we should reevaluate the throughput and then decide if we need to do more. With the improvements made thus far, I don't think this needs to block the next rc.

Comment by Michael Cahill (Inactive) [ 10/Nov/17 ]

milkie, is there anything to do in this ticket? Now that we have SERVER-31899, can we close this as a dup?

Comment by Eric Milkie [ 09/Nov/17 ]

I filed SERVER-31899 to try to get back more performance – this is the mutex that shows that calls to getTerm() and setMyLastAppliedOptimeForward() were hitting contention.

Comment by Eric Milkie [ 09/Nov/17 ]

I have now pushed two commits into the master branch that should have a positive effect on performance of this. Continuing to look at perf numbers for the next bottleneck to address.

Comment by Susan LoVerso [ 08/Nov/17 ]

I have attached the full pmp output for the patched run with contention on the SessionCache mutex. I do not see an obvious thread that holds it. Note that no threads are active in WiredTiger.

Comment by Susan LoVerso [ 08/Nov/17 ]

The second run shows very different stacks. I see a lot of contention on the SessionCache mutex including threads in Michael's patch ensureSession:

      7 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::WiredTigerSessionCache::releaseSession,mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit,operator(),~unique_ptr,~OperationContext,~OperationContext,mongo::ServiceContext::OperationContextDeleter::operator(),reset,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      3 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::WiredTigerSessionCache::getSession,mongo::WiredTigerRecoveryUnit::_ensureSession,_ensureSession,mongo::WiredTigerRecoveryUnit::_txnOpen,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::WiredTigerSessionCache::getSession,mongo::WiredTigerRecoveryUnit::_ensureSession,_ensureSession,mongo::WiredTigerRecoveryUnit::_txnOpen,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::WiredTigerSessionCache::releaseSession,operator(),~unique_ptr,mongo::WiredTigerSessionCache::waitUntilDurable

There are also two threads waiting on the repl coordinator mutex in setMyLastAppliedOpTimeForward and one waiting on the newOpMutex in getNextOpTimes.

Comment by Susan LoVerso [ 08/Nov/17 ]

This is very interesting. I ran pmp on two separate runs of the test and saw very different output. On one run, I still see more than half the workload threads waiting on the repl coordinator mutex in setMyLastAppliedOpTimeForward even with the small change Michael added. Also it would make sense for the same change to go into setMyLastDurableOpTimeForward:

      7 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,std::unique_lock<std::mutex>::lock,unique_lock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 __lll_unlock_wake,_L_unlock_697,pthread_mutex_unlock,__gthread_mutex_unlock,unlock,std::unique_lock<std::mutex>::unlock,~unique_lock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock,unique_lock,mongo::repl::ReplicationCoordinatorImpl::setMyLastDurableOpTimeForward,mongo::WiredTigerSessionCache::waitUntilDurable

The thread holding it is at the beginning of the worker function:

      1 mongo::executor::ThreadPoolTaskExecutor::now,mongo::repl::ReplicationCoordinatorImpl::_setMyLastAppliedOpTime_inlock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()

Comment by Susan LoVerso [ 08/Nov/17 ]

Interestingly, Michael and I both ran with his patch installed and saw different stacks. He frequently saw threads in synchronizeOnOplogInFlightResource and my runs saw setMyLastAppliedOpTimeForward and getTerm. I am running on AWS, I don't know where Michael is running. These numbers are all with a production build of WiredTiger (my earlier numbers were with a diagnostic build):

Here are my results today:

MongoDB Build WiredTiger Build ops/sec
r3.5.8 mongodb-3.5.8 113k
master (11e68f0) develop (d51a7e37) 69k
master+all diffs above WT-3715 branch 90k

With those changes we've regained a large portion of the performance, but we're still below 3.5.8. I'll keep looking into the stacks of the patched runs and see if I can find more bottlenecks.

Comment by Eric Milkie [ 08/Nov/17 ]

Nice find. I agree that’s a good change to make. Tomorrow morning I’ll take Michael’s patch and prepare it for commit.

Comment by Susan LoVerso [ 07/Nov/17 ]

While I focused yesterday on WiredTiger locking statistics I think the biggest problem is that nearly all threads are waiting on a single lock here:

     11 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()
      1 sched_yield,__wt_yield,__wt_writeunlock,__wt_txn_set_commit_timestamp,__wt_txn_set_timestamp,__session_timestamp_transaction,mongo::WiredTigerRecoveryUnit::setTimestamp,mongo::WiredTigerRecordStore::oplogDiskLocRegister,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,operator()

Mongod runs 50% faster in 3.5.12 than it does in 3.5.13 (and later, up to today's master). 3.5.12 gets about 67k ops/sec and 3.5.13+ gets 45k ops/sec. Current master also gets 45k ops/sec.

milkie the stacks above show where the bottleneck is. I see threads waiting on the mutex while one thread is settings its first commit timestamp on a transaction. michael.cahill has a diff that pulls the starting of a transaction outside the mutex in getNextOpTimes. That has a large positive effect on performance. I have attached it ensure-session.patch . It needs review, naming and cleanup but shows the idea. michael.cahill's numbers show this patch gets almost 50% speedup (master 64k ops/sec, patched 93k ops/sec).

Comment by Michael Cahill (Inactive) [ 06/Nov/17 ]

sue.loverso, thanks for the update. The changes you are suggesting to __wt_txn_global_set_timestamp are in the WT-3715 branch.

Comment by Susan LoVerso [ 06/Nov/17 ]

Last week I ran this test against successive releases from 3.5.9-3.5.13. Each showed a gradual decline but then there was a large drop between 3.5.12 and 3.5.13. There has been more, gradual code added that uses timestamps in that period as well.

Release insert ops/sec
3.5.9 80k
3.5.11 76k
3.5.12 66k
3.5.13 46k

I instrumented a whole bunch of temporary statistics (to 3.5.13) to characterize this today. Here are some results. It builds on the write locks mentioned above by Bruce but with a lot more detail.

  • Every transaction takes the txn_global->commit_timestamp_rwlock for writing twice (once to insert into the queue, once to remove).
  • Acquiring that rwlock for writing fails 1.3% of the time (consistently).
  • Nearly every commit takes the txn_global->rwlock for writing.
  • Acquiring the global rwlock in commit is 66% of the total acquisitions.
  • Every call to conn->set_timestamp takes the txn_global->rwlock for writing.
  • Acquiring the global rwlock for writing in wt_txn_global_set_timestamp fails 2% of the time (consistently).
  • Only 1.7% of calls to wt_txn_global_set_timestamp result in actually updating any of the global timestamps.
  • Acquiring the global rwlock in wt_txn_global_set_timestamp is the remaining 34% of the total lock acquisitions.

The lowest hanging fruit here seems to be in wt_txn_global_set_timestamp. That is called from replication code and it always, unconditionally acquires the txn_global->rwlock for writing and it does not need to update 98.3% of the time. If we could avoid taking that lock until we're actually needing to update, that should help particularly if the replication stack is holding a lock in its own code while setting the timestamp.

Comment by Susan LoVerso [ 03/Nov/17 ]

Thanks for the explanations! While SERVER-30845 that milkie mentioned is one path, I'm sure the locks inside WiredTiger are the long pole here. As I've been testing each release I'm seeing gradual degradation in time (with the bigger leap at 3.5.13) so it is likely that the gradual addition of more code using timestamps has increased the contention each 3.5.N release. I have new ideas for experiments I want to run with this.

Comment by William Schultz (Inactive) [ 02/Nov/17 ]

As Andy mentioned, my commit c7661b14867cd058e1a67986b8e05a7020fc0a5e did introduce a considerable regression, which was noted by BF-6399. It was resolved, however, by commit bf8ae1b3edbf7470e92b7407f76a042cc2246d48. Even with that fix, though, it could still be investigated as a possible source of the regression against 3.4. Perhaps we should determine if the slowest part of the "set stable timestamp" logic lies at the replication layer (i.e. ReplicationCoordinatorImpl::_setStableTimestampForStorage_inlock), or at the storage layer (StorageInterface::setStableTimestamp)?

Comment by Eric Milkie [ 02/Nov/17 ]

There is also SERVER-30845.

Comment by Andy Schwerin [ 02/Nov/17 ]

In those stacks, there is a thread blocked on a WT spinlock trying to set the "oldest timestamp" / "stable timestamp for storage" while holding the replication coordinator mutex. If that's actually taking a long time, and not a coincidence, then it is causing other inserts to block waiting to do the same (7 threads) or to assign a term to their oplog entries (2 threads).

That makes c7661b14867cd058e1a67986b8e05a7020fc0a5e by william.schultz a candidate. It committed on August 8, and so the first release it appeared in was 3.5.12.

sue.loverso, is it hard to run your test at and immediately before this commit? If not, could you?

If this is the problem, then our goal is probably to speed up _storage->setStableTimestamp, to shorten the critical section. redbeard0531, did you look into this earlier when optimizing?

Does this affect benchmarks other than bulk insert?

Comment by Susan LoVerso [ 02/Nov/17 ]

I have attached the full pmp output files showing all threads that contains the threads above for both 3.5.9 and 3.5.13.

Comment by Susan LoVerso [ 02/Nov/17 ]

Ok, since 3.5.13 the replication initiate stuff changed such that running just rs.initiate() was failing and then running the script resulted in the 14k ops/sec in some hobbled mode. Correctly starting replication I'm seeing about 46k ops/sec. Much better but far less than the 80k I see consistently in 3.5.9. Running pmp.sh on during the test, I see most/all threads in the WiredTiger library inserting items in 3.5.9 and most insert threads waiting for a replication lock in 3.5.13. If I grep receivedInsert pmp.out as the way to see what the 15 application threads are doing I see in 3.5.9:

     14 select,__wt_sleep,__wt_page_in_func,__wt_page_swap_func,__wt_row_search,__cursor_row_search,__wt_btcur_insert,__curfile_insert,mongo::WiredTigerRecordStore::_insertRecords,mongo::WiredTigerRecordStore::insertRecords,mongo::CollectionImpl::_insertDocuments,mongo::CollectionImpl::insertDocuments,insertDocuments,mongo::insertDocuments,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::processMessage,mongo::ServiceStateMachine::runNext,operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      1 __split_insert,__wt_split_insert,__evict_review,__wt_evict,__wt_page_release_evict,__wt_page_in_func,__wt_page_swap_func,__wt_row_search,__cursor_row_search,__wt_btcur_insert,__curfile_insert,mongo::WiredTigerRecordStore::_insertRecords,mongo::WiredTigerRecordStore::insertRecords,mongo::CollectionImpl::_insertDocuments,mongo::CollectionImpl::insertDocuments,insertDocuments,mongo::insertDocuments,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::processMessage,mongo::ServiceStateMachine::runNext,operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone

But in 3.5.13, 11 out of the 15 threads are waiting on one or more locks in mongoDB and only a single thread is inside WiredTiger:

      7 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock,unique_lock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      2 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::repl::ReplicationCoordinatorImpl::getTerm,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      1 tcmalloc::ThreadCache::FetchFromCentralCache,Allocate,do_malloc_small,do_malloc,cpp_alloc,tc_new,mongo::WiredTigerRecordStore::OplogStones::updateCurrentStoneAfterInsertOnCommit,mongo::WiredTigerRecordStore::_insertRecords,mongo::WiredTigerRecordStore::insertRecordsWithDocWriter,mongo::CollectionImpl::insertDocumentsForOplog,insertDocumentsForOplog,mongo::repl::_logOpsInner,mongo::repl::logInsertOps,mongo::OpObserverImpl::onInserts,mongo::CollectionImpl::insertDocuments,insertDocuments,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      1 sched_yield,__wt_yield,__wt_readlock,__wt_txn_global_set_timestamp,__conn_set_timestamp,mongo::WiredTigerKVEngine::_setOldestTimestamp,mongo::repl::ReplicationCoordinatorImpl::_setStableTimestampForStorage_inlock,mongo::repl::ReplicationCoordinatorImpl::_setMyLastAppliedOpTime_inlock,mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward,operator(),mongo::RecoveryUnit::OnCommitChange::commit,mongo::repl::_logOpsInner,mongo::WiredTigerRecoveryUnit::_commit,commit,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      1 operator==,mongo::WiredTigerIndex::setKey,mongo::WiredTigerIndexUnique::_insert,mongo::WiredTigerIndex::insert,mongo::IndexAccessMethod::insert,mongo::IndexCatalogImpl::_indexFilteredRecords,mongo::IndexCatalogImpl::_indexRecords,mongo::IndexCatalogImpl::indexRecords,indexRecords,mongo::CollectionImpl::_insertDocuments,mongo::CollectionImpl::insertDocuments,insertDocuments,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      1 lock,lock_guard,mongo::LockerImpl<false>::getLockMode,mongo::LockerImpl<false>::isLocked,mongo::repl::ReplicationCoordinatorImpl::canAcceptWritesFor,mongo::repl::ReplicationCoordinatorImpl::shouldRelaxIndexConstraints,mongo::IndexCatalogImpl::prepareInsertDeleteOptions,mongo::IndexCatalogImpl::_indexFilteredRecords,mongo::IndexCatalogImpl::_indexRecords,mongo::IndexCatalogImpl::indexRecords,indexRecords,mongo::CollectionImpl::_insertDocuments,mongo::CollectionImpl::insertDocuments,insertDocuments,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      1 __lll_lock_wait,_L_lock_791,pthread_mutex_lock,__gthread_mutex_lock,lock,lock_guard,mongo::repl::(anonymous,mongo::repl::getNextOpTimes,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone
      1 GetOrDefault,GetSizeClassIfCached,do_free_helper,do_free_with_callback,do_free,tc_free,~intrusive_ptr,~SharedBuffer,~ConstSharedBuffer,~BSONObj,~OplogDocWriter,_Destroy<mongo::repl::(anonymous,__destroy<mongo::repl::(anonymous,_Destroy<mongo::repl::(anonymous,_Destroy<mongo::repl::(anonymous,~vector,mongo::repl::logInsertOps,mongo::OpObserverImpl::onInserts,mongo::CollectionImpl::insertDocuments,insertDocuments,mongo::(anonymous,operator(),writeConflictRetry<mongo::(anonymous,insertBatchAndHandleErrors,mongo::performInserts,receivedInsert,mongo::ServiceEntryPointMongod::handleRequest,mongo::ServiceStateMachine::_processMessage,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void,_scheduleFunc<mongo::ServiceStateMachine::scheduleNext(mongo::transport::ServiceExecutor::ScheduleFlags)::<lambda()>,scheduleNext,mongo::ServiceStateMachine::_sourceCallback,mongo::ServiceStateMachine::_runNextInGuard,mongo::ServiceStateMachine::runNext,operator(),operator(),std::_Function_handler<void(),,operator(),mongo::(anonymous,start_thread,clone

spencer can you take a look and re-assign if needed?

Comment by Susan LoVerso [ 02/Nov/17 ]

could you please retest against the WT-3715 branch to see whether it improves things?

Running on that branch with MongoDB master yields the same throughput, about 14100 ops/sec. So there is no change compared to the earlier run I did last week. Continuing to try to isolate when in the timeline the large drop happened and then get stacks and isolate further.

Comment by Susan LoVerso [ 02/Nov/17 ]

Yes, I've been running tests today trying to isolate when it started dropping so much, but I'll retest with the branch first.

Comment by Michael Cahill (Inactive) [ 02/Nov/17 ]

sue.loverso, could you please retest against the WT-3715 branch to see whether it improves things? There are some changes in that branch that avoid acquiring the timestamp rwlock when we're not going to update anything.

Also, the lookaside eviction changes probably explain most of the additional regression when replication is enabled, so that should also be better on the branch.

Thanks!

Comment by Susan LoVerso [ 26/Oct/17 ]

Here are the results I ran today. I ran 3.5.8 with the current WiredTiger develop branch but it showed the same numbers as straight 3.5.8. I will further investigate the timestamp and replication issue above and given my much worse numbers, I'm sure it'll be easy to get stacks. I will run with successive releases because the timestamp work was rolled out gradually. The numbers below are the average of 3 runs, but they were not statistically far apart run-to-run.

Mode 3.4.9 3.5.8 MongoDB Master + WT develop
Standalone 205497 130524 (-37%) 115737 (-44%)
1 node repl 126561 88990 (-30%) 14131 (-89%)
Standalone/repl diff -39% -32% -88%
Comment by Susan LoVerso [ 26/Oct/17 ]

bruce.lucas thank you for those stacks. I ran eight variations today and I'll post the full results soon. But your stacks give a strong clue for where we should start. FYI I ran MongoDB-3.5.8+current WiredTiger-develop and MongoDB-master+WiredTiger-develop. Master is running 84% slower than 3.5.8 (14100 ops/sec versus 88000 ops/sec).

Comment by Bruce Lucas (Inactive) [ 26/Oct/17 ]

I looked at the ftdc data from the test and confirmed that in 3.6.0-rc0 there are a large number of write-lock calls for the replica set case, but not for the standalone case, confirming that it is a possible suspect for the regression seen in the former but not the latter.

Using gdb determined that we're calling __wt_writelock in the following 5 places in equal numbers, apparently each once per update:

__wt_writelock
__wt_txn_clear_commit_timestamp
__wt_txn_release
__wt_txn_commit
__session_commit_transaction
mongo::WiredTigerRecoveryUnit::_txnClose
mongo::WiredTigerRecoveryUnit::_commit
mongo::UpdateStage::transformAndUpdate
 
__wt_writelock
__wt_txn_commit
__session_commit_transaction
mongo::WiredTigerRecoveryUnit::_txnClose
mongo::WiredTigerRecoveryUnit::_commit
mongo::UpdateStage::transformAndUpdate
 
__wt_writelock
__wt_txn_global_set_timestamp
__conn_set_timestamp
mongo::WiredTigerKVEngine::_setOldestTimestamp
mongo::repl::ReplicationCoordinatorImpl::_setStableTimestampForStorage_inlock
mongo::repl::ReplicationCoordinatorImpl::_setMyLastAppliedOpTime_inlock
mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward
void mongo::RecoveryUnit::onCommit<...>
mongo::WiredTigerRecoveryUnit::_commit
mongo::UpdateStage::transformAndUpdate
 
__wt_writelock
__wt_txn_set_commit_timestamp
__wt_txn_set_timestamp
__session_timestamp_transaction
mongo::WiredTigerRecoveryUnit::setTimestamp
mongo::WiredTigerRecordStore::oplogDiskLocRegister
mongo::repl::(anonymous namespace)::_getNextOpTimes
mongo::repl::logOp
mongo::OpObserverImpl::onUpdate
mongo::CollectionImpl::updateDocumentWithDamages
mongo::UpdateStage::transformAndUpdate
 
__wt_writelock
__wt_txn_update_pinned_timestamp
__wt_txn_global_set_timestamp
__conn_set_timestamp
mongo::WiredTigerKVEngine::_setOldestTimestamp
mongo::repl::ReplicationCoordinatorImpl::_setStableTimestampForStorage_inlock
mongo::repl::ReplicationCoordinatorImpl::_setMyLastAppliedOpTime_inlock
mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward
void mongo::RecoveryUnit::onCommit<...>
mongo::WiredTigerRecoveryUnit::_commit
mongo::UpdateStage::transformAndUpdate

Comment by Bruce Lucas (Inactive) [ 24/Oct/17 ]

12 cores / 24 cpus
SSD disk (but probably not relevant since there are no checkpoints in the run and not specifying j:true)
cache far from full

Comment by Daniel Pasette (Inactive) [ 24/Oct/17 ]

what's your machine setup?

Comment by Bruce Lucas (Inactive) [ 24/Oct/17 ]

It can use either depending on value of writeCmd option; ops/s is different depending on which you choose but regression is essentially the same.

Comment by Eric Milkie [ 24/Oct/17 ]

How does this workload compare to the insert Microbenchmark tests and the YCSB load sys-perf tests?
Also, does benchrun use old-style OP_QUERY inserts, or does it use write commands?

Generated at Thu Feb 08 04:27:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.