[SERVER-18281] lock contention in mongo::DBConnectionPool::get Created: 30/Apr/15  Updated: 06/Oct/15  Resolved: 06/Oct/15

Status: Closed
Project: Core Server
Component/s: Internal Client, Sharding
Affects Version/s: 3.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rui Zhang (Inactive) Assignee: Misha Tyulenev
Resolution: Done Votes: 0
Labels: 32qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding A (10/09/15)
Participants:

 Description   

Test

  • 3.0.1
  • profiling with linux perf on mongos
  • 7 shard of 3 member replica-set
  • load 20M doc, then run mixed traffic

Observation

During multiple shard testing, saw mongos mongo::DBConnectionPool::get usage is high in both CPU usage (mainly due to spin_lock) and sleep time (due to futex in Off-CPU capture)

The following numbers are for 100% read, saw similar behavior with insert and mixed traffic

Type Total Event/Trace Trace w/ mongo::DBConnectionPool::get %
on CPU 262959 81145 30.9%
off CPU 2373010 165658 7%

Note:

  • On CPU measures CPU usage by sampling on CPU stack. in this case, it is mainly caused by spin_lock
  • Off CPU is to show sleep due to f/mutex or IO. See https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times for more details.
  • Trace w/ mongo::DBConnectionPool::get is to count how many time this function show up anywhere in the stack track.
  • Majority of the sleep time of mongos is due to wait for network request/response, which shall be normal. Other sleep time,

On CPU usage with spin_lock

-  61.44%  mongos  [kernel.kallsyms]    [k] _raw_spin_lock
   - _raw_spin_lock
      - 59.40% futex_wake
           do_futex
           sys_futex
           system_call_fastpath
         - __lll_unlock_wake
            - 42.15% mongo::DBConnectionPool::get(std::string const&, double)
                 mongo::ShardConnection::_init()
                 mongo::ShardConnection::ShardConnection(mongo::Shard const&, std::string const&, boost::shared_ptr<mongo::ChunkManager const>)
                 mongo::ParallelSortClusteredCursor::setupVersionAndHandleSlaveOk(boost::shared_ptr<mongo::ParallelConnectionState>, mongo::Shard const&, boost::shared_ptr<mongo::Shard>, mongo::NamespaceString const&, std::string const&,
               + mongo::ParallelSortClusteredCursor::startInit()
            + 32.80% mongo::ShardConnection::releaseMyConnections()
            + 22.66% mongo::ClientConnections::done(std::string const&, mongo::DBClientBase*)
            + 0.71% mongo::DBClientReplicaSet::_getMonitor() const
            + 0.58% mongo::Shard::reset(std::string const&)
            + 0.57% mongo::StaticShardInfo::find(std::string const&)
      - 40.19% futex_wait_setup
           futex_wait
           do_futex
           sys_futex
           system_call_fastpath
         - __lll_lock_wait
            - 41.75% mongo::DBConnectionPool::get(std::string const&, double)
                 mongo::ShardConnection::_init()
                 mongo::ShardConnection::ShardConnection(mongo::Shard const&, std::string const&, boost::shared_ptr<mongo::ChunkManager const>)
                 mongo::ParallelSortClusteredCursor::setupVersionAndHandleSlaveOk(boost::shared_ptr<mongo::ParallelConnectionState>, mongo::Shard const&, boost::shared_ptr<mongo::Shard>, mongo::NamespaceString const&, std::string const&,
                 mongo::ParallelSortClusteredCursor::startInit()
                 mongo::ParallelSortClusteredCursor::fullInit()
                 mongo::Strategy::queryOp(mongo::Request&)
                 mongo::Request::process(int)
                 mongo::ShardedMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*)
                 mongo::PortMessageServer::handleIncomingMsg(void*)
                 start_thread
            + 39.48% mongo::ShardConnection::releaseMyConnections()
            + 15.74% mongo::ClientConnections::done(std::string const&, mongo::DBClientBase*)
            + 1.14% mongo::ReplicaSetMonitor::get(std::string const&, bool)
            + 0.92% mongo::StaticShardInfo::find(std::string const&)
            + 0.71% mongo::Shard::reset(std::string const&)
-   8.23%  mongos  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 67.11% try_to_wake_up
           wake_up_state
           wake_futex
           futex_wake
           do_futex
           sys_futex
           system_call_fastpath
         - __lll_unlock_wake
            - 36.71% mongo::DBConnectionPool::get(std::string const&, double)
                 mongo::ShardConnection::_init()
                 mongo::ShardConnection::ShardConnection(mongo::Shard const&, std::string const&, boost::shared_ptr<mongo::ChunkManager const>)
                 mongo::ParallelSortClusteredCursor::setupVersionAndHandleSlaveOk(boost::shared_ptr<mongo::ParallelConnectionState>, mongo::Shard const&, boost::shared_ptr<mongo::Shard>, mongo::NamespaceString const&, std::string const&,
               + mongo::ParallelSortClusteredCursor::startInit()
            + 17.55% mongo::ShardConnection::releaseMyConnections()
            + 9.34% mongo::ClientConnections::done(std::string const&, mongo::DBClientBase*)
            + 8.95% mongo::Shard::reset(std::string const&)
            + 8.58% mongo::StaticShardInfo::find(std::string const&)
            + 7.52% mongo::DBClientReplicaSet::getServerAddress() const
            + 6.38% mongo::DBClientReplicaSet::_getMonitor() const
            + 2.60% mongo::checkShardVersion(mongo::DBClientBase*, std::string const&, boost::shared_ptr<mongo::ChunkManager const>, bool, int)
            + 1.82% mongo::DBConfig::getChunkManagerOrPrimary(std::string const&, boost::shared_ptr<mongo::ChunkManager const>&, boost::shared_ptr<mongo::Shard>&)
      + 28.80% __wake_up_sync_key
      + 3.06% mod_timer
      + 1.02% prepare_to_wait

off-cpu sleep due to lock

-  53.90%          mongos  [kernel.kallsyms]  [k] __schedule
   - __schedule
   - schedule
      + 85.45% schedule_timeout
      - 12.46% futex_wait_queue_me
           futex_wait
           do_futex
           sys_futex
         - system_call_fastpath
            - 93.04% __lll_lock_wait
               + 15.94% mongo::ClientConnections::done(std::string const&, mongo::DBClientBase*)
               + 15.15% mongo::checkShardVersion(mongo::DBClientBase*, std::string const&, boost::shared_ptr<mongo::ChunkManager const>, bool, int)
               + 14.79% mongo::ReplicaSetMonitor::get(std::string const&, bool)
               - 13.30% mongo::DBConnectionPool::get(std::string const&, double)
                  - 99.95% mongo::ShardConnection::_init()
                       mongo::ShardConnection::ShardConnection(mongo::Shard const&, std::string const&, boost::shared_ptr<mongo::ChunkManager const>)
                       mongo::ParallelSortClusteredCursor::setupVersionAndHandleSlaveOk(boost::shared_ptr<mongo::ParallelConnectionState>, mongo::Shard const&, boost::shared_ptr<mongo::Shard>, mongo::NamespaceString const&, std::string c
                       mongo::ParallelSortClusteredCursor::startInit()
                       mongo::ParallelSortClusteredCursor::fullInit()
                       mongo::Strategy::queryOp(mongo::Request&)
                       mongo::Request::process(int)
                       mongo::ShardedMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*)
                       mongo::PortMessageServer::handleIncomingMsg(void*)
                       start_thread
               + 12.95% mongo::StaticShardInfo::find(std::string const&)
               + 7.44% mongo::ReplicaSetMonitor::getHostOrRefresh(mongo::ReadPreferenceSetting const&)
               + 6.64% mongo::ShardConnection::releaseMyConnections()
               + 6.09% mongo::ParallelSortClusteredCursor::startInit()
               + 3.31% mongo::Request::reset()
               + 1.95% mongo::ReplicaSetMonitor::getServerAddress() const
               + 1.80% mongo::Shard::reset(std::string const&)
               + 0.52% _nss_files_gethostbyname2_r
            + 3.88% pthread_cond_timedwait@@GLIBC_2.3.2
            + 3.03% __lll_lock_wait_private
      + 1.49% schedule_hrtimeout_range_clock
      + 0.60% do_nanosleep



 Comments   
Comment by Misha Tyulenev [ 06/Oct/15 ]

This is not an issue for 3.1 where mongos uses new code: https://jira.mongodb.org/browse/SERVER-15176

Comment by Misha Tyulenev [ 06/Oct/15 ]

mongos uses different code path starting from 3.1.9. In case the lock contention it will be different issue and different locks.

Comment by Ian Whalen (Inactive) [ 05/Oct/15 ]

misha.tyulenev - can you please retest this since you're already looking at perf related to the new commands?

Comment by Andy Schwerin [ 28/Sep/15 ]

Bumping to 3.1 Required to make sure Rui re-tests per comments in this ticket, after SERVER-15176.

Comment by Rui Zhang (Inactive) [ 21/Jul/15 ]

sure, will re-test when ready. add this to my backlog queue

Comment by Andy Schwerin [ 20/Jul/15 ]

rui.zhang, please retest against 3.1 after SERVER-15176 is resolved.

Generated at Thu Feb 08 03:47:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.