[SERVER-52688] db.serverStatus() hang on SECONDARY server Created: 09/Nov/20  Updated: 23/Sep/21  Resolved: 04/Dec/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Peter Volkov Assignee: Edwin Zhou
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File gdb_2020-11-09_15-58-04.txt.xz     File gdb_2020-11-09_16-34-39.txt.xz     File gdb_202109-23_11-07-08.txt.xz     File mongodb.log-20201111.xz    
Issue Links:
Related
is related to SERVER-60164 db.serverStatus() hang on SECONDARY s... Closed
Operating System: ALL
Steps To Reproduce:

I don't know. We just found server in such state 3 days ago.

Participants:

 Description   

Whenever I run `mongo --host mongo3:27017 --eval 'db.serverStatus()'` this command hangs for a very long time (never managed to get reply).
mongo3 is a SECONDARY server in mongodb replica set. On PRIMARY and other SECONDARIES everything works as expected.

Output of the following command in attachment (gdb_2020-11-09_15-58-04.txt.xz):

gdb p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y%m-%d_%H-%M-%S"`.txt

 

This is Gentoo Linux with 4.2.10 mongodb installed.

Another observation. I've tried to stop mongodb and it hanged with the following in logs:

2020-11-09T16:32:51.302+0300 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2020-11-09T16:32:51.304+0300 I REPL [signalProcessingThread] Stepping down the ReplicationCoordinator for shutdown, waitTime: 10000ms
2020-11-09T16:32:51.309+0300 I SHARDING [signalProcessingThread] Shutting down the WaitForMajorityService
2020-11-09T16:32:51.313+0300 I CONTROL [signalProcessingThread] Shutting down the LogicalSessionCache
2020-11-09T16:32:51.316+0300 I CONTROL [LogicalSessionCacheRefresh] Failed to refresh session cache: ClientMarkedKilled: client has been killed
2020-11-09T16:32:51.320+0300 I CONTROL [LogicalSessionCacheReap] Failed to reap transaction table: ClientMarkedKilled: client has been killed
2020-11-09T16:32:51.321+0300 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
2020-11-09T16:32:51.321+0300 I NETWORK [listener] removing socket file: /tmp/mongodb-27017.sock
2020-11-09T16:32:51.321+0300 I NETWORK [signalProcessingThread] Shutting down the global connection pool
2020-11-09T16:32:51.321+0300 I STORAGE [signalProcessingThread] Shutting down the FlowControlTicketholder
2020-11-09T16:32:51.321+0300 I - [signalProcessingThread] Stopping further Flow Control ticket acquisitions.
2020-11-09T16:32:51.321+0300 I STORAGE [signalProcessingThread] Shutting down the PeriodicThreadToAbortExpiredTransactions
2020-11-09T16:32:51.321+0300 I STORAGE [signalProcessingThread] Shutting down the PeriodicThreadToDecreaseSnapshotHistoryIfNotNeeded
2020-11-09T16:32:51.322+0300 I REPL [signalProcessingThread] Shutting down the ReplicationCoordinator
2020-11-09T16:32:51.322+0300 I REPL [signalProcessingThread] shutting down replication subsystems
2020-11-09T16:32:51.322+0300 I REPL [signalProcessingThread] Stopping replication reporter thread
2020-11-09T16:32:51.322+0300 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo1:27017: CallbackCanceled: Reporter no longer valid
2020-11-09T16:32:51.323+0300 I REPL [signalProcessingThread] Stopping replication fetcher thread
2020-11-09T16:32:51.324+0300 I REPL [signalProcessingThread] Stopping replication applier thread
2020-11-09T16:32:51.507+0300 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
2020-11-09T16:32:51.508+0300 I REPL [rsBackgroundSync] Stopping replication producer

 

I've run gdb command (mentioned above) again and output is in gdb_2020-11-09_16-34-39.txt.xz.

After I've killed server with -9 and started again db.serverStatus() started to work.

 



 Comments   
Comment by Peter Volkov [ 23/Sep/21 ]

Ah, I see, there is no way to reopen this issue, I've opened SERVER-60164 to see if new debugging information will help.

Comment by Peter Volkov [ 23/Sep/21 ]

It happened again. According to monitoring, this happened around 2021-09-22 12:31:35.
Diagnostic data that should cover the moment when the problem happened: https://disk.yandex.ru/d/IIW42L63fVojuw

New backtrace, gdb_202109-23_11-07-08.txt.xz , is in attachment as well.

Comment by Edwin Zhou [ 04/Dec/20 ]

Thank you for the update. Given that we lack adequate information to diagnose this issue, I'll close this as Incomplete.

Edwin

Comment by Peter Volkov [ 04/Dec/20 ]

Nope, this happened only once. Of course, it's possible that this was caused by cosmic radiation, but this happened just after the upgrade so I decided to open the report. Thank you for your investigation. If this happens again, I'll provide more information.

Comment by Edwin Zhou [ 04/Dec/20 ]

Hi peter.volkov@gmail.com,

We'd like to know if this issue has repeated. Would you please inform us if this is a recurring problem or if has occurred on a different machine?

Thanks,
Edwin

Comment by Edwin Zhou [ 19/Nov/20 ]

Hi peter.volkov@gmail.com,

After taking a closer look at the gdb you provided, we're unable to determine what caused all of the threads to be stuck waiting to acquire the global lock. At the moment, we don't have enough information.

Has his problem ever occurred on a different machine? Is this a recurring problem?

Best,

Edwin

Comment by Edwin Zhou [ 16/Nov/20 ]

After looking into the gdb, we were able to find stack traces that confirm the behavior for both instances of the gdb being collected. However, we still don't know what caused the issue.

The first gdb (gdb_2020-11-09_15-58-04.txt) correlates to the deadlock when running db.serverStatus(). We can see that 337 threads are struggling on serverStatus.

337.000   │      ▽ mongo::ServiceStateMachine::_sourceCallback
337.000   │       ▽ mongo::ServiceStateMachine::_scheduleNextWithGuard
337.000   │        ▽ mongo::transport::ServiceExecutorSynchronous::schedule
337.000   │         ▽ std::_Function_handler<...>
337.000   │          ▽ mongo::ServiceStateMachine::_runNextInGuard
337.000   │           ▽ mongo::ServiceStateMachine::_processMessage
337.000   │            ▽ mongo::ServiceEntryPointMongod::handleRequest
337.000   │             ▽ mongo::ServiceEntryPointCommon::handleRequest
337.000   │              ▽ mongo::(anonymous namespace)::receivedCommands
337.000   │               ▽ mongo::(anonymous namespace)::execCommandDatabase
337.000   │                ▽ mongo::(anonymous namespace)::runCommandImpl
337.000   │                 ▽ mongo::BasicCommand::Invocation::run
337.000   │                  ▽ mongo::CmdServerStatus::run
337.000   │                   ▽ mongo::WiredTigerServerStatusSection::generateSection
337.000   │                    ▽ mongo::Lock::GlobalLock::GlobalLock
337.000   │                     ▽ mongo::Lock::GlobalLock::GlobalLock
337.000   │                      ▽ mongo::Lock::GlobalLock::_enqueue
337.000   │                       ▽ mongo::Lock::ResourceLock::lock
337.000   │                        ▽ mongo::LockerImpl::lockComplete
337.000   │                         ▽ mongo::CondVarLockGrantNotification::wait
337.000   │                          ▽ mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil
337.000   │                           ▽ mongo::ClockSource::waitForConditionUntil
337.000   │                            ▽ std::cv_status mongo::Waitable::wait_until<...>
337.000   │                             ▽ mongo::transport::TransportLayerASIO::BatonASIO::run_until
337.000   │                              ▽ mongo::transport::TransportLayerASIO::BatonASIO::run
337.000   │                               ▽ poll

The following gdb (gdb_2020-11-09_16-34-39.txt) is when the mongod is attempting to be killed with -15. It gets stuck on _stopDataReplication_inlock. Afterward, the server is killed using -9.

  1.000   │  ▽ mongo::(anonymous namespace)::signalProcessingThread
  1.000   │ │ ▽ mongo::shutdown
  1.000   │ │  ▽ mongo::(anonymous namespace)::runTasks
  1.000   │ │   ▽ mongo::(anonymous namespace)::shutdownTask
  1.000   │ │    ▽ mongo::repl::ReplicationCoordinatorImpl::shutdown
  1.000   │ │     ▽ mongo::repl::ReplicationCoordinatorExternalStateImpl::shutdown
  1.000   │ │      ▽ mongo::repl::ReplicationCoordinatorExternalStateImpl::_stopDataReplication_inlock
  1.000   │ │       ▽ mongo::Interruptible::waitForConditionOrInterruptUntil<...>
  1.000   │ │        ▽ std::condition_variable::wait:53
  1.000   │ │         ▽ __gthread_cond_wait:865
  1.000   │ │          ▽ pthread_cond_wait@@GLIBC_2.3.2

Questions:

  1. bruce.lucas, are we able to find out what is failing to release the lock that prevents serverStatus to complete?
Comment by Edwin Zhou [ 16/Nov/20 ]

Hi peter.volkov@gmail.com,

Unfortunately the attached diagnostic data does not cover the issue that occurred (issue occurs 11/09, diagnostics start from 11/10), and there is no way to recover that lost data.

However, we are glad to hear that the issue went away when you restarted mongod. If it occurs again, we'd love for you to please again attach the archived diagnostic.data, gdb, and logs so we can further investigate the hang on db.serverStatus().

Best

Edwin

Comment by Peter Volkov [ 15/Nov/20 ]

Log file attached. As for diagnostic.data, it looks like it is too large, so I've put it here: https://yadi.sk/d/zY2ltxD8td9yrQ

Comment by Edwin Zhou [ 13/Nov/20 ]

Hi peter.volkov@gmail.com,

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket? Could you also attach the log files and attach it to this ticket?

Kind regards,
Edwin

Generated at Thu Feb 08 05:28:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.