[SERVER-60164] db.serverStatus() hang on SECONDARY server #secondtime Created: 23/Sep/21  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Peter Volkov Assignee: Edwin Zhou
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-10-12 at 4.34.35 PM.png     File gdb_202109-23_11-07-08.txt.xz    
Issue Links:
Duplicate
duplicates SERVER-56054 Change minThreads value for replicati... Closed
is duplicated by SERVER-60165 db.serverStatus() hang on SECONDARY s... Closed
is duplicated by SERVER-60166 db.serverStatus() hang on SECONDARY s... Closed
Related
related to SERVER-52688 db.serverStatus() hang on SECONDARY s... Closed
is related to SERVER-47554 Replica Set member suddenly stopped r... Closed
is related to SERVER-54805 Mongo become unresponsive, Spike in C... Closed
Operating System: ALL
Steps To Reproduce:

I have no idea how to reproduce this problem but sometimes it happens.

This is again dev-db/mongodb-4.2.10 on Genoo. Yea we need to uprade but I haven't seen any fixes for this specific problem.

Participants:

 Description   

This is a continuation of SERVER-52688. I'm unable to reopen the issue, so I'm opening a new one.

The problem is the same `mongo --eval 'db.serverStatus()'` hangs forever on SECONDARY server. This problem appeared around 2021-09-22 12:31:35.

Diagnostic data that should cover the moment when the problem happened: https://disk.yandex.ru/d/IIW42L63fVojuw

Attached gdb_202109-23_11-07-08.txt.xz is the output of the following command:

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y%m-%d_%H-%M-%S"`.txt

 

 



 Comments   
Comment by Edwin Zhou [ 13/Oct/21 ]

Hi peter.volkov@gmail.com,

Thank you for following up, providing us detailed diagnostic data, and for your patience while we investigate this issue. I can confirm that the behavior you're seeing is the same as the previous ticket you've opened, SERVER-52688. The gdb that you provided showed us the following symptoms:

  1. The oplog buffer is full (OplogBufferBlockingQueue::waitForSpace)

    1 0;clone;start_thread;std::execute_native_thread_routine:80;std::thread::_State_impl<...>;mongo::ThreadPool::_workerThreadBody;mongo::ThreadPool::_consumeTasks;mongo::ThreadPool::_doOneTask;mongo::unique_function<...>;mongo::executor::ThreadPoolTaskExecutor::runCallback;mongo::unique_function<...>;std::_Function_handler<...>;mongo::Fetcher::_callback;mongo::repl::AbstractOplogFetcher::_callback;mongo::repl::OplogFetcher::_onSuccessfulBatch;std::_Function_handler<...>;mongo::repl::BackgroundSync::_enqueueDocuments;mongo::repl::OplogBufferBlockingQueue::waitForSpace;std::condition_variable::wait:53;__gthread_cond_wait:865;pthread_cond_wait@@GLIBC_2.3.2
    

  2. The oplog applier is active but waiting for worker threads to finish (SyncTail::multiApply;mongo::ThreadPool::waitForIdle)

    1 0;clone;start_thread;std::execute_native_thread_routine:80;std::thread::_State_impl<...>;mongo::ThreadPool::_workerThreadBody;mongo::ThreadPool::_consumeTasks;mongo::ThreadPool::_doOneTask;mongo::unique_function<...>;mongo::executor::ThreadPoolTaskExecutor::runCallback;mongo::unique_function<...>;mongo::repl::OplogApplierImpl::_run;mongo::repl::SyncTail::oplogApplication;mongo::repl::SyncTail::_oplogApplication;mongo::repl::SyncTail::multiApply;mongo::ThreadPool::waitForIdle;std::condition_variable::wait:53;__gthread_cond_wait:865;pthread_cond_wait@@GLIBC_2.3.2
    

  3. And its worker threads were waiting for work

    17 0;clone;start_thread;std::execute_native_thread_routine:80;std::thread::_State_impl<...>;mongo::ThreadPool::_workerThreadBody;mongo::ThreadPool::_consumeTasks;std::condition_variable::wait:53;__gthread_cond_wait:865;pthread_cond_wait@@GLIBC_2.3.2
    

Given these symptoms and the diagnostic data you provided, we are able track down SERVER-47554, where we suspected that there's something specific to the OS or the libc binary that is triggering this behavior. You may help further verify that this is the same issue by providing your exact OS distro and version, and glibc version. SERVER-56054 alleviates this behavior caused by a bug present on glibc v2.27. This fix is included in MongoDB v4.2.15 and I recommend upgrading to the latest version of MongoDB v4.2.

I will close this issue as a duplicate of SERVER-56054.

Best,
Edwin

Comment by Peter Volkov [ 24/Sep/21 ]

Ok, I've uploaded both files: diagnostic.data.tar.xz and mongodb.log.xz to support uploader. I don't see any references here, but I hope that's expected. There was parent": {"id": "146396515042"} in both uploads.

Comment by Eric Sedor [ 23/Sep/21 ]

Hi peter.volkov@gmail.com, thank you very much for writing back in given the additional information. Sorry for the jira inconveniences.

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory covering the incident and time of gdb stack trace and upload them to this support uploader location?

This is preferable to us accessing an external link for obtaining files.

Gratefully,
Eric

Generated at Thu Feb 08 05:49:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.