[SERVER-60016] serverStatus and collection stats should not block on the RSTL lock Created: 16/Sep/21  Updated: 29/Oct/23  Resolved: 01/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0, 5.0.14

Type: Task Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-63219 Change collStats cmd to run without t... Closed
Problem/Incident
Related
is related to SERVER-64053 Support optionally skipping the RSTL ... Backlog
is related to SERVER-63264 Create a way to have serverStatus out... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.0, v4.4, v4.2
Sprint: Execution Team 2021-10-04, Execution Team 2021-10-18, Execution Team 2022-02-07, Execution Team 2022-02-21, Execution Team 2022-03-07
Participants:
Linked BF Score: 35

 Description   

This ticket would explore whether the RSTL lock is necessary for serverStatus and collection stats.

I noticed these cmds in backtraces stuck waiting for the RSTL lock while stepdown hung. I think there are potentially valuable diagnostics we aren't for this sort of hang. 



 Comments   
Comment by Githook User [ 03/Nov/22 ]

Author:

{'name': 'Jordi Olivares Provencio', 'email': 'jordi.olivares-provencio@mongodb.com', 'username': 'jordiolivares'}

Message: SERVER-60016 serverStatus should not block on the RSTL lock with the WT storage engine
Branch: v5.0
https://github.com/mongodb/mongo/commit/9709ef3d8cbb715196c5609339b06793a4551886

Comment by Githook User [ 01/Mar/22 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-60016 add requires_replication and requires_sharding tags to serverStatus_does_not_block_on_RSTL.js
Branch: master
https://github.com/mongodb/mongo/commit/9eaa498adf3aba641c079387db9fee56b5d0af9d

Comment by Githook User [ 28/Feb/22 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-60016 serverStatus should not block on the RSTL lock with the WT storage engine
Branch: master
https://github.com/mongodb/mongo/commit/0402164de8b0433cc9eb4fd137c96811dc0130db

Comment by Bruce Lucas (Inactive) [ 17/Feb/22 ]

Good news, thanks!

Comment by Dianna Hohensee (Inactive) [ 17/Feb/22 ]

Thanks for the tips bruce.lucas! I ended up being able to either remove locks entirely or use GlobalLock w/o the RSTL lock  So serverStatus shouldn't need the RSTL lock anymore. Though I've filed SERVER-63264 as a followup to create some way of ensuring my JS test doesn't become stale when new non-default serverStatus sections are added in future.

Comment by Bruce Lucas (Inactive) [ 03/Feb/22 ]

A word of caution about using lock acquisition deadlines in FTDC: if lock acquisition times out intermittently but frequently, it can cause frequent schema changes in FTDC as sections are omitted and then re-introduced. If it happens frequently enough that can decrease compression efficiency of FTDC, which can limit FTDC retention and impact diagnosability.

I believe brad.devlugt recently added performance tests that hopefully would detect that if it happens in more-or-less normal workloads. SERVER-58026 has some details of a case where this became (and remains) an issue.

Comment by Dianna Hohensee (Inactive) [ 02/Feb/22 ]

Copying the serverStatus stack trace stuck on RSTL acquisition.

Thread 904 (Thread 0x7ff60a684700 (LWP 78577)):
#0  0x00007ff663a47c3d in poll () from /lib64/libc.so.6
#1  0x000055777baacb2f in mongo::transport::TransportLayerASIO::BatonASIO::run(mongo::ClockSource*) ()
#2  0x000055777ba9644d in mongo::transport::TransportLayerASIO::BatonASIO::run_until(mongo::ClockSource*, mongo::Date_t) ()
#3  0x000055777c0f3f81 in mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t, mongo::Waitable*) ()
#4  0x000055777c0e7075 in mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t) ()
#5  0x000055777b07da62 in _ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockINS_12latch_detail5LatchEEZNS_28CondVarLockGrantNotification4waitEPNS_16OperationContextENS_8DurationISt5ratioILl1ELl1000EEEEEUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_PNS_10AtomicWordIlEEENKUlSJ_NS0_9WakeSpeedEE3_clESJ_SO_ ()
#6  0x000055777b07dedf in bool mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >)::{lambda()#1}, mongo::AtomicWord<long>*) [clone .constprop.692] ()
#7  0x000055777b07e17b in mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >) ()
#8  0x000055777b080a86 in mongo::LockerImpl::_lockComplete(mongo::OperationContext*, mongo::ResourceId, mongo::LockMode, mongo::Date_t) ()
#9  0x000055777b072e78 in mongo::Lock::GlobalLock::_takeGlobalAndRSTLLocks(mongo::LockMode, mongo::Date_t) ()
#10 0x000055777b07336e in mongo::Lock::GlobalLock::GlobalLock(mongo::OperationContext*, mongo::LockMode, mongo::Date_t, mongo::Lock::InterruptBehavior, bool) ()
#11 0x000055777ace3407 in mongo::AutoGetOplog::AutoGetOplog(mongo::OperationContext*, mongo::OplogAccessMode, mongo::Date_t) ()
#12 0x000055777997e8d3 in mongo::repl::(anonymous namespace)::OplogInfoServerStatus::generateSection(mongo::OperationContext*, mongo::BSONElement const&) const ()
#13 0x00005577798e4a99 in mongo::ServerStatusSection::appendSection(mongo::OperationContext*, mongo::BSONElement const&, mongo::BSONObjBuilder*) const ()
#14 0x000055777b096e01 in mongo::CmdServerStatus::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::BSONObjBuilder&) ()
#15 0x0000557779960c38 in mongo::BasicCommand::runWithReplyBuilder(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::rpc::ReplyBuilderInterface*) ()
#16 0x000055777b0a88ce in mongo::BasicCommandWithReplyBuilderInterface::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*) ()
#17 0x000055777b0a14ef in mongo::CommandHelpers::runCommandInvocation(mongo::OperationContext*, mongo::OpMsgRequest const&, mongo::CommandInvocation*, mongo::rpc::ReplyBuilderInterface*) ()
#18 0x000055777b0a677e in mongo::CommandHelpers::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>, mongo::transport::ServiceExecutor::ThreadingModel) ()
#19 0x00005577798beb87 in mongo::(anonymous namespace)::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>) ()
#20 0x00005577798d1116 in mongo::(anonymous namespace)::RunCommandImpl::_runCommand() ()
#21 0x00005577798d214b in mongo::(anonymous namespace)::RunCommandImpl::_runImpl() ()
#22 0x00005577798c30ec in mongo::(anonymous namespace)::RunCommandImpl::run() ()
#23 0x00005577798c52fe in mongo::(anonymous namespace)::ExecCommandDatabase::_commandExec() ()
#24 0x00005577798cdde0 in mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}::operator()() ()
#25 0x00005577798cffd0 in _ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationIvZZNOS9_4thenIZNS_12_GLOBAL__N_114executeCommandESt10shared_ptrINSC_13HandleRequest16ExecutionContextEEEUlvE0_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EESN_E_EENS7_ISI_EEOT0_EUlS3_E_EEDaSJ_EN12SpecificImpl4callEOS3_ ()
#26 0x000055777988e487 in mongo::future_details::SharedStateBase::transitionToFinished() ()
#27 0x00005577798d43b2 in _ZN5mongo14future_details10FutureImplINS0_8FakeVoidEE11generalImplIZNOS3_17propagateResultToEPNS0_15SharedStateImplIS2_EEEUlOS2_E_ZNOS3_17propagateResultToES7_EUlONS_6StatusEE0_ZNOS3_17propagateResultToES7_EUlvE1_EEDaOT_OT0_OT1_ ()
#28 0x00005577798c779c in _ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationIvZZNOS9_4thenIZNS_12_GLOBAL__N_114executeCommandESt10shared_ptrINSC_13HandleRequest16ExecutionContextEEEUlvE_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EESN_E_EENS7_ISI_EEOT0_EUlS3_E_EEDaSJ_EN12SpecificImpl4callEOS3_ ()
#29 0x000055777988e487 in mongo::future_details::SharedStateBase::transitionToFinished() ()
#30 0x00005577798ce7ca in mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>) ()
#31 0x00005577798cf1fe in mongo::(anonymous namespace)::receivedCommands(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>) ()
#32 0x00005577798cff16 in mongo::(anonymous namespace)::CommandOpRunner::run() ()
#33 0x00005577798ca187 in mongo::ServiceEntryPointCommon::handleRequest(mongo::OperationContext*, mongo::Message const&, std::unique_ptr<mongo::ServiceEntryPointCommon::Hooks const, std::default_delete<mongo::ServiceEntryPointCommon::Hooks const> >) ()
#34 0x00005577798ba1cf in mongo::ServiceEntryPointMongod::handleRequest(mongo::OperationContext*, mongo::Message const&) ()
#35 0x0000557779912107 in mongo::transport::ServiceStateMachine::Impl::processMessage() ()
#36 0x0000557779912437 in _ZZNO5mongo14future_details10FutureImplINS0_8FakeVoidEE4thenIZNS_9transport19ServiceStateMachine4Impl12startNewLoopERKNS_6StatusEEUlvE0_EEDaOT_ENKUlOS2_E_clESE_.isra.718 ()
#37 0x0000557779914d68 in mongo::transport::ServiceStateMachine::Impl::startNewLoop(mongo::Status const&) ()
#38 0x000055777991526f in auto mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::transport::ServiceStateMachine::Impl::scheduleNewLoop(mongo::Status)::{lambda(mongo::Status)#1}>(mongo::transport::ServiceStateMachine::Impl::scheduleNewLoop(mongo::Status)::{lambda(mongo::Status)#1}&&)::SpecificImpl::call(mongo::Status&&) ()
#39 0x000055777ba75e01 in auto mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::transport::ServiceExecutorSynchronous::runOnDataAvailable(std::shared_ptr<mongo::transport::Session> const&, mongo::unique_function<void (mongo::Status)>)::{lambda(mongo::Status)#1}>(mongo::transport::ServiceExecutorSynchronous::runOnDataAvailable(std::shared_ptr<mongo::transport::Session> const&, mongo::unique_function<void (mongo::Status)>)::{lambda(mongo::Status)#1}&&)::SpecificImpl::call(mongo::Status&&) ()
#40 0x0000557779e3ef93 in auto mongo::unique_function<void ()>::makeImpl<mongo::transport::ServiceExecutor::schedule(mongo::unique_function<void (mongo::Status)>)::{lambda()#1}>(mongo::transport::ServiceExecutor::schedule(mongo::unique_function<void (mongo::Status)>)::{lambda()#1}&&)::SpecificImpl::call() ()
#41 0x000055777ba75fbb in auto mongo::unique_function<void ()>::makeImpl<mongo::transport::ServiceExecutorSynchronous::scheduleTask(mongo::unique_function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags)::{lambda()#2}>(mongo::transport::ServiceExecutorSynchronous::scheduleTask(mongo::unique_function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags)::{lambda()#2}&&)::SpecificImpl::call() ()
#42 0x000055777ba79e6c in auto mongo::unique_function<void ()>::makeImpl<mongo::launchServiceWorkerThread(mongo::unique_function<void ()>)::{lambda()#4}>(mongo::launchServiceWorkerThread(mongo::unique_function<void ()>)::{lambda()#4}&&)::SpecificImpl::call() ()
#43 0x000055777ba79edc in mongo::(anonymous namespace)::runFunc(void*) ()
#44 0x00007ff663d29ea5 in start_thread () from /lib64/libpthread.so.0
#45 0x00007ff663a5296d in clone () from /lib64/libc.so.6

Takes RSTL in OplogInfoServerStatus section.

Other ServerStatusSection implementations may acquire locks. We'd need to 1) find them 2) invariant against later additions of locks or 3) use lock acquisition deadlines (maybe way to abort on RSTL lock acquisition failure?) and move onto the next section that we can report on.

Comment by Connie Chen [ 16/Sep/21 ]

We should also review if this is related to RSTL lock fix for profiling.

Generated at Thu Feb 08 05:48:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.