Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7291

MongoS can hang after replica set reconfiguration due to writeback listener command sequence

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.7
    • Component/s: Sharding
    • Environment:
      CentOS 6.3, MongoDB v2.0.7 release. Note this may affect v2.2 as well.
    • ALL

      In MongoS, ReplicaSetMonitor::_checkConnection() acquires _checkConnectionLock, then calls _checkConnection(). _checkConnection() then calls _checkStatus(), which issues a blocking request to other nodes for replSetGetStatus.

      This causes all commands sent to mongos to hang, apparently due to a combination WriteBackCommand::run() on mongod and WriteBackListener::run() on mongos.

      Exact steps to reproduce are unclear, however this was encountered after some combination of the following steps:

      1. Removing a node from a replica set (via rs.reconfig())
      2. Hiding a node (via rs.reconfig())
      3. Unhiding a node (via rs.reconfig())

      Note the following stack traces while mongos was unable to process any command:

      mongod WriteBackCommand, waiting to pop from blocking queue:

      Thread 25 (Thread 0x7f6430558700 (LWP 28047)):
      #0  0x00000032b720b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      #1  0x0000000000a39f99 in mongo::WriteBackCommand::run(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj&, int, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, mongo::BSONObjBuilder&, bool) ()
      #2  0x000000000097d994 in mongo::execCommand(mongo::Command*, mongo::Client&, int, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&, bool) ()
      #3  0x000000000097ef8f in mongo::_runCommands(char const*, mongo::BSONObj&, mongo::_BufBuilder<mongo::TrivialAllocator>&, mongo::BSONObjBuilder&, bool, int) ()
      #4  0x00000000009420c5 in mongo::runCommands(char const*, mongo::BSONObj&, mongo::CurOp&, mongo::_BufBuilder<mongo::TrivialAllocator>&, mongo::BSONObjBuilder&, bool, int) ()
      #5  0x0000000000944bf0 in mongo::runQuery(mongo::Message&, mongo::QueryMessage&, mongo::CurOp&, mongo::Message&) ()
      #6  0x0000000000888fd7 in ?? ()
      #7  0x000000000088dbb9 in mongo::assembleResponse(mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
      #8  0x0000000000aa0b38 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
      #9  0x0000000000638767 in mongo::pms::threadRun(mongo::MessagingPort*) ()
      #10 0x00000032b7207851 in start_thread () from /lib64/libpthread.so.0
      #11 0x00000032b6ee811d in clone () from /lib64/libc.so.6
      

      mongos WriteBackListener::run(), which has acquired _checkConnectionLock:

      Thread 18 (Thread 0x7f4b83d0b700 (LWP 27781)):
      #0  0x00000032b720e94c in recv () from /lib64/libpthread.so.0
      #1  0x0000000000550803 in mongo::Socket::_recv(char*, int) ()
      #2  0x0000000000550819 in mongo::Socket::unsafe_recv(char*, int) ()
      #3  0x0000000000551cf4 in mongo::Socket::recv(char*, int) ()
      #4  0x0000000000558db6 in mongo::MessagingPort::recv(mongo::Message&) ()
      #5  0x000000000055961b in mongo::MessagingPort::recv(mongo::Message const&, mongo::Message&) ()
      #6  0x0000000000559aa4 in mongo::MessagingPort::call(mongo::Message&, mongo::Message&) ()
      #7  0x000000000057898c in mongo::DBClientConnection::call(mongo::Message&, mongo::Message&, bool, std::basic_string<char, std::char_traits<char>, std::allocator<char> >*) ()
      #8  0x00000000005945fd in mongo::DBClientCursor::init() ()
      #9  0x0000000000567edc in mongo::DBClientBase::query(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::Query, int, int, mongo::BSONObj const*, int, int) ()
      #10 0x000000000057e781 in mongo::DBClientConnection::query(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::Query, int, int, mongo::BSONObj const*, int, int) ()
      #11 0x00000000005760c3 in mongo::DBClientInterface::findN(std::vector<mongo::BSONObj, std::allocator<mongo::BSONObj> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::Query, int, int, mongo::BSONObj const*, int) ()
      #12 0x0000000000576c12 in mongo::DBClientInterface::findOne(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::Query const&, mongo::BSONObj const*, int) ()
      #13 0x000000000057eafa in mongo::DBClientConnection::runCommand(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::BSONObj&, int) ()
      #14 0x0000000000584e7d in mongo::ReplicaSetMonitor::_checkStatus(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
      #15 0x0000000000586560 in mongo::ReplicaSetMonitor::_checkConnection(mongo::DBClientConnection*, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool, int) ()
      #16 0x00000000005875ce in mongo::ReplicaSetMonitor::_check(bool) ()
      #17 0x00000000005881fe in mongo::ReplicaSetMonitor::getMaster() ()
      #18 0x00000000005884bf in mongo::DBClientReplicaSet::checkMaster() ()
      #19 0x000000000058b1d6 in mongo::DBClientReplicaSet::connect() ()
      #20 0x00000000005790b9 in mongo::ConnectionString::connect(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, double) const ()
      #21 0x0000000000562972 in mongo::DBConnectionPool::get(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double) ()
      #22 0x00000000005c3e6c in mongo::ShardConnection::_init() ()
      #23 0x00000000005c43a5 in mongo::ShardConnection::ShardConnection(mongo::Shard const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
          ()
      #24 0x0000000000768367 in mongo::Strategy::insert(mongo::Shard const&, char const*, mongo::BSONObj const&, int, bool) ()
      #25 0x000000000076ba74 in mongo::ShardStrategy::_insert(mongo::Request&, mongo::DbMessage&, boost::shared_ptr<mongo::ChunkManager const>) ()
      #26 0x0000000000773514 in mongo::ShardStrategy::writeOp(int, mongo::Request&) ()
      #27 0x00000000007b4b7d in mongo::Request::process(int) ()
      #28 0x00000000007ece27 in mongo::WriteBackListener::run() ()
      #29 0x0000000000524e4f in mongo::BackgroundJob::jobBody(boost::shared_ptr<mongo::BackgroundJob::JobStatus>) ()
      #30 0x00000000005271c4 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf1<void, mongo::BackgroundJob, boost::shared_ptr<mongo::BackgroundJob::JobStatus> >, boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob*>, boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > > >::run() ()
      #31 0x00000000008053f0 in thread_proxy ()
      #32 0x00000032b7207851 in start_thread () from /lib64/libpthread.so.0
      #33 0x00000032b6ee811d in clone () from /lib64/libc.so.6
      

      mongos balancer thread, waiting to acquire _checkConnectionLock (this is just one of many threads waiting for this lock):

      Thread 23 (Thread 0x7f4b87011700 (LWP 27676)):
      #0  0x00000032b720e054 in __lll_lock_wait () from /lib64/libpthread.so.0
      #1  0x00000032b7209388 in _L_lock_854 () from /lib64/libpthread.so.0
      #2  0x00000032b7209257 in pthread_mutex_lock () from /lib64/libpthread.so.0
      #3  0x000000000058d013 in mongo::mutex::scoped_lock::scoped_lock(mongo::mutex&) ()
      #4  0x0000000000585cf5 in mongo::ReplicaSetMonitor::_checkConnection(mongo::DBClientConnection*, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool, int) ()
      #5  0x00000000005875ce in mongo::ReplicaSetMonitor::_check(bool) ()
      #6  0x00000000005881fe in mongo::ReplicaSetMonitor::getMaster() ()
      #7  0x00000000005884bf in mongo::DBClientReplicaSet::checkMaster() ()
      #8  0x000000000058b1d6 in mongo::DBClientReplicaSet::connect() ()
      #9  0x00000000005790b9 in mongo::ConnectionString::connect(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, double) const ()
      #10 0x0000000000562972 in mongo::DBConnectionPool::get(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double) ()
      #11 0x0000000000562f62 in mongo::ScopedDbConnection::ScopedDbConnection(mongo::Shard const*, double) ()
      #12 0x0000000000756b15 in mongo::Shard::runCommand(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&) const ()
      #13 0x00000000007d9444 in mongo::Balancer::_checkOIDs() ()
      #14 0x00000000007dae61 in mongo::Balancer::run() ()
      #15 0x0000000000524e4f in mongo::BackgroundJob::jobBody(boost::shared_ptr<mongo::BackgroundJob::JobStatus>) ()
      #16 0x00000000005271c4 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf1<void, mongo::BackgroundJob, boost::shared_ptr<mongo::BackgroundJob::JobStatus> >, boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob*>, boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > > >::run() ()
      #17 0x00000000008053f0 in thread_proxy ()
      #18 0x00000032b7207851 in start_thread () from /lib64/libpthread.so.0
      #19 0x00000032b6ee811d in clone () from /lib64/libc.so.6
      

            Assignee:
            benjamin.becker Ben Becker
            Reporter:
            benjamin.becker Ben Becker
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: