Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-59478

Move serverStatus command before taking RSTL in catchup_takeover_with_higher_config.js

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.0.3, 4.4.9, 5.1.0-rc0
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • v5.0, v4.4
    • Repl 2021-08-23
    • 48

      When executing this serverStatus command,  by default it outputs WiredTiger information, which takes a GlobalLock (involves taking RSTL), but the RSTL is already taken by the stepup thread that was hung by a failpoint. Normally since the global lock acquisition by the serverStatus command has a deadline that is set to Date_t::now(), the lock acquisition should fail fail quickly if it cannot acquire RSTL. However it seems that sometimes this lock acquisition can be blocked (maybe due to faulty system clock that affects the lock waiting implementation), thus hanging the test:

      #0  0x00007f56a67047e1 in poll () from /lib64/libc.so.6
      #1  0x0000562dd02fc3b3 in mongo::transport::TransportLayerASIO::BatonASIO::run(mongo::ClockSource*) ()
      #2  0x0000562dd02e646d in mongo::transport::TransportLayerASIO::BatonASIO::run_until(mongo::ClockSource*, mongo::Date_t) ()
      #3  0x0000562dd07dc9e1 in mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t, mongo::Waitable*) ()
      #4  0x0000562dd07d0600 in mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t) ()
      #5  0x0000562dd0783ea5 in mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >)::{lambda()#1}, mongo::AtomicWord<long>*)::{lambda(auto:1&, mongo::Interruptible::WakeSpeed)#3}::operator()(std::unique_lock<mongo::latch_detail::Latch>&, mongo::AtomicWord<long>*) const ()
      #6  0x0000562dd078453c in mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >) ()
      #7  0x0000562dd07863e6 in mongo::LockerImpl::_lockComplete(mongo::OperationContext*, mongo::ResourceId, mongo::LockMode, mongo::Date_t) ()
      #8  0x0000562dd0778978 in mongo::Lock::GlobalLock::GlobalLock(mongo::OperationContext*, mongo::LockMode, mongo::Date_t, mongo::Lock::InterruptBehavior) ()
      #9  0x0000562dcecc562b in mongo::WiredTigerServerStatusSection::generateSection(mongo::OperationContext*, mongo::BSONElement const&) const ()
      #10 0x0000562dcea63429 in mongo::ServerStatusSection::appendSection(mongo::OperationContext*, mongo::BSONElement const&, mongo::BSONObjBuilder*) const ()
      #11 0x0000562dcf66bfca in mongo::CmdServerStatus::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::BSONObjBuilder&) ()
      #12 0x0000562dcf84229a in mongo::BasicCommandWithReplyBuilderInterface::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*) ()
      #13 0x0000562dcf83cabf in mongo::CommandHelpers::runCommandInvocation(mongo::OperationContext*, mongo::OpMsgRequest const&, mongo::CommandInvocation*, mongo::rpc::ReplyBuilderInterface*) ()
      

      To mitigate this, we can move the serverStatus before taking RSTL, to avoid any such cases.

            Assignee:
            wenbin.zhu@mongodb.com Wenbin Zhu
            Reporter:
            wenbin.zhu@mongodb.com Wenbin Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: