[SERVER-56784] The replication thread of secondary hang up Created: 10/May/21  Updated: 27/Oct/23  Resolved: 16/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.9, 4.0.19
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: FirstName lipengchong Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File ps1     HTML File ps2    
Issue Links:
Related
related to SERVER-56054 Change minThreads value for replicati... Closed
Operating System: ALL
Participants:

 Description   

Recently, We encountered a strange phenomenon
some 4.0 mongodb sharding cluster , The replication of secondary hang up. So the lag between primary and secondary have growing so large.
 
I have colloect the pstack data of mongod.
 
we can know that 16 replWriterThread is waiting for tasks, meaning they are idle。
```
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x5580fc7dd458) at ../sysdeps/unix/sysv/linux/futex-internal.h:88#1 _pthread_cond_wait_common (abstime=0x0, mutex=0x5580fc7dd400, cond=0x5580fc7dd430) at pthread_cond_wait.c:502#2 __pthread_cond_wait (cond=0x5580fc7dd430, mutex=0x5580fc7dd400) at pthread_cond_wait.c:655#3 0x00005580f5f7ceec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()#4 0x00005580f5632750 in mongo::ThreadPool::_consumeTasks() ()#5 0x00005580f5632e86 in mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()#6 0x00005580f56331be in std::thread::_Impl<std::_Bind_simple<mongo::stdx::thread::thread<mongo::ThreadPool::_startWorkerThread_inlock()::{lambda()#1}, , 0>(mongo::ThreadPool::_startWorkerThread_inlock()::{lambda()#1})::{lambda()#1} ()> >::_M_run() ()#7 0x00005580f5f7ff60 in execute_native_thread_routine ()#8 0x00007fd5151a2fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486#9 0x00007fd5150d14cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
```
 
but the batcher thread is waitForIdle for repl thread.
```
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x5580fc7dd48c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x5580fc7dd400, cond=0x5580fc7dd460) at pthread_cond_wait.c:502#2 __pthread_cond_wait (cond=0x5580fc7dd460, mutex=0x5580fc7dd400) at pthread_cond_wait.c:655#3 0x00005580f5f7ceec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()#4 0x00005580f56311bb in mongo::ThreadPool::waitForIdle() ()#5 0x00005580f4816d91 in mongo::repl::SyncTail::multiApply(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry, std::allocator<mongo::repl::OplogEntry> >) ()#6 0x00005580f48186e3 in mongo::repl::SyncTail::_oplogApplication(mongo::repl::OplogBuffer*, mongo::repl::ReplicationCoordinator*, mongo::repl::SyncTail::OpQueueBatcher*) ()#7 0x00005580f48198c3 in mongo::repl::SyncTail::oplogApplication(mongo::repl::OplogBuffer*, mongo::repl::ReplicationCoordinator*) ()
```
 
so i guess there is a bug here, but i don't find what's the root cause of the bug.



 Comments   
Comment by FirstName lipengchong [ 11/May/21 ]

wow,  thanks very much.  @Dima

Comment by Dmitry Agranat [ 10/May/21 ]

Thanks lpc for proactively collecting stack traces and providing the rest of the information. Based on this information, we suspect this issue is related to the glibc bug (which is not related to MongoDB). This behavior has only manifest on systems with glibc versions susceptible to this glibc pthread condition variable bug. In other words, this bug impacts glibc versions >= 2.27, and since your version is 2.28, you are impacted by this issue.

Even though this bug is not related to MongoDB, we have created SERVER-56054 to try to address this issue from the MongoDB side. Backports were requested to earlier MongoDB versions but were not committed yet. There are currently 3 ways to address this issue:

  • Wait for SERVER-56054 to be fixed in the MongoDB version you are currently using (I cannot provide an ETA for these backports)
  • Downgrade to an OS with the glibc older than 2.27
  • Wait for the glibc bug to be fixed

Dima

Comment by FirstName lipengchong [ 10/May/21 ]

it's Debian10.3, glibc version is  2.28

lipengchong@host:~$ uname  -a
Linux host 4.19.0-12-amd64 #1 SMP Debian 4.19.152-1 (2020-10-18) x86_64 GNU/Linux

lipengchong@host:~$ ldd --version 
ldd (Debian GLIBC 2.28-10) 2.28 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Written by Roland McGrath and Ulrich Drepper.

Comment by Dmitry Agranat [ 10/May/21 ]

Hi lpc,

Could you please provide the exact OS version as well as glibc version for the MongoDB server in question?

Thanks,
Dima

Comment by FirstName lipengchong [ 10/May/21 ]

I  am sorry that the format  above is  orderless

 

we can know that 16 replWriterThread is waiting for tasks, meaning they are idle。

#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x5580fc7dd458) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580fc7dd400, cond=0x5580fc7dd430) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5580fc7dd430, mutex=0x5580fc7dd400) at pthread_cond_wait.c:655
#3  0x00005580f5f7ceec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#4  0x00005580f5632750 in mongo::ThreadPool::_consumeTasks() ()
#5  0x00005580f5632e86 in mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#6  0x00005580f56331be in std::thread::_Impl<std::_Bind_simple<mongo::stdx::thread::thread<mongo::ThreadPool::_startWorkerThread_inlock()::{lambda()#1}, , 0>(mongo::ThreadPool::_startWorkerThread_inlock()::{lambda()#1})::{lambda()#1} ()> >::_M_run() ()
#7  0x00005580f5f7ff60 in execute_native_thread_routine ()
#8  0x00007fd5151a2fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#9  0x00007fd5150d14cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

but the batcher thread is waitForIdle for repl thread.

#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x5580fc7dd48c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5580fc7dd400, cond=0x5580fc7dd460) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5580fc7dd460, mutex=0x5580fc7dd400) at pthread_cond_wait.c:655
#3  0x00005580f5f7ceec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
#4  0x00005580f56311bb in mongo::ThreadPool::waitForIdle() ()
#5  0x00005580f4816d91 in mongo::repl::SyncTail::multiApply(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry, std::allocator<mongo::repl::OplogEntry> >) ()
#6  0x00005580f48186e3 in mongo::repl::SyncTail::_oplogApplication(mongo::repl::OplogBuffer*, mongo::repl::ReplicationCoordinator*, mongo::repl::SyncTail::OpQueueBatcher*) ()
#7  0x00005580f48198c3 in mongo::repl::SyncTail::oplogApplication(mongo::repl::OplogBuffer*, mongo::repl::ReplicationCoordinator*) ()

so i guess there is a bug here, but i don't find what's the root cause of the bug.

 

Generated at Thu Feb 08 05:40:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.