[SERVER-26723] Mongos stalls for even simple queries Created: 21/Oct/16  Updated: 01/Feb/18  Resolved: 28/Nov/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Darshan Shah Assignee: Kaloian Manassiev
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File bxb-ppe-oas006-mongos.stacks     File mongod.log-20161021-chi-ppe-oas009.gz     File mongodb_configdb-20161021-bxb-ppe-oas002.log.gz     File mongos.log-20161021-bxb-ppe-oas010-after_restart.gz     File mongos.log-20161021-bxb-ppe-oas010.gz     Text File mongos_bxb-ppe-oas006.log    
Issue Links:
Duplicate
is duplicated by SERVER-26654 ExceededTimeLimit: Operation timed ou... Closed
is duplicated by SERVER-26701 MongoS stalls when it cannot access o... Closed
Related
related to SERVER-26722 router blocks and throws ExceededTime... Closed
Operating System: ALL
Participants:
Case:

 Description   

We have a 32 node cluster where each node is a 3 member replicaset having a hidden member.
Cluster is running MongoDb 3.2.5 with WiredTiger on RHEL7.
We see that some mongos processes get stuck and just do not return even for a very simple query while most other mongos processes are not affected.
This happens to different nodes and is not limited to specific machines. It had happened yesterday on another machine and then after restarting the cluster, it happened on this machine.

Queries like count() or show dbs or show collections that most probably access only the metadata - work fine on the mongos of the affected machine. However, running any simple query stalls the mongos.

Attaching mongos logs for a mongos process that is not responding as well the config server primary node log.
Note that the mongo processes were started with -v flag to increase logging verbosity



 Comments   
Comment by Darshan Shah [ 01/Feb/18 ]

Just FYI - the original problem was resolved when we upgraded MongoDb to 3.2.11 and also upgraded glibc to 2.17.

However now we are noticing problems similar to SERVER-26722 and SERVER-29206 even after upgrading to MongoDb 3.2.18 running on RHEL7.1.

Comment by Vick Mena (Inactive) [ 01/Dec/16 ]

To Summarize:

  • Users running glibc versions < 2.21 are susceptible to the infinite loop issue
  • Users running glibc versions < 2.23 are susceptible to the kernel bugs in 3.17-3.19

Recommendations:

  • Upgrade glibc to a minimum 2.21
  • Users running the affected kernels should upgrade to the latest v4.0 for their vendor and optionally upgrade glibc to 2.23

As a reminder, RHEL 7.3 has kernel 3.10 and glibc 2.17

Comment by Roy Reznik [ 29/Nov/16 ]

The bug that you linked to is still in the state "NEW" meaning that there is no operating system that contains a fix for that.
Can you specify which kernel versions are safe to use?

Comment by Kelsey Schubert [ 28/Nov/16 ]

Hi all,

We have completed our investigation and concluded that the issue described in this ticket is caused by a bug in glibc. Therefore, we will be resolving this ticket, as there is no work to be done on the MongoDB server to correct this behavior.

To ensure that you are not affected by this issue, please upgrade your operating system to a version that contains the fix for https://bugzilla.kernel.org/show_bug.cgi?id=99671

Please be aware that this issue is distinct from SERVER-26722, which is not caused by a bug in glibc. We are continuing to investigate this issue, and will update SERVER-26722 when we know more.

Thank you,
Thomas

Comment by Roy Reznik [ 27/Nov/16 ]

Any news on this?
It's a ticket causing a complete mongo meltdown on sharded environments and logs were provided a while ago already.

Comment by Kelsey Schubert [ 22/Nov/16 ]

Hi alessandro.gherardi@yahoo.com,

Thanks for uploading the logs. We are still investigating this issue and will let update this ticket when we know more.

Kind regards,
Thomas

Comment by Alessandro Gherardi [ 16/Nov/16 ]

@Kaloian Manassiev - I uploaded log files from our server to https://jira.mongodb.org/browse/SERVER-26654 . I understand that ticket is now closed - feel free to move those log files here.

Comment by Ramon Fernandez Marina [ 16/Nov/16 ]

jon@appboy.com, unfortunately we have not yet been able to confirm the root cause for this issue, so it can't be part of 3.2.11.

As kaloian.manassiev mentioned above we're trying to determine if the problem is related to the glibc bug listed above.

Thanks,
Ramón.

Comment by Jon Hyman [ 16/Nov/16 ]

I noticed that the 3.2.11-rc1 note came out today - is this going to be targeted for 3.2.11? We are unable to upgrade because of this issue / SERVER-26654, so would appreciate if this could be attempted to make it into 3.2.11.

Comment by Darshan Shah [ 14/Nov/16 ]

[oasprod@bxb-ppe-oas002 tmp]$ uname -a
Linux bxb-ppe-oas002 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux

Comment by Kaloian Manassiev [ 11/Nov/16 ]

Thanks a lot, darshan.shah@interactivedata.com for providing the complete thread stack traces. Based on this information I see that all but one of the asio host name resolution threads are stuck in the hostname resolution call:

#0  0x00007f4f676fdb6c in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f4f67713efd in _L_lock_746 () from /lib64/libc.so.6
#2  0x00007f4f67713cb5 in __check_pf () from /lib64/libc.so.6
#3  0x00007f4f676d9d31 in getaddrinfo () from /lib64/libc.so.6

and one thread has not returned from a recvmsg call:

#0  0x00007f4f676f111d in recvmsg () from /lib64/libc.so.6
#1  0x00007f4f6771390d in make_request () from /lib64/libc.so.6
#2  0x00007f4f67713e04 in __check_pf () from /lib64/libc.so.6
#3  0x00007f4f676d9d31 in getaddrinfo () from /lib64/libc.so.6
#4  0x0000000000c8a9b7 in asio::detail::socket_ops::getaddrinfo(char const*, char const*, addrinfo const&, addrinfo**, std::error_code&) ()

This seems to be similar to a deadlock bug reported for glibc and we are now working on confirming that:

Can you please run "uname -a" on one of the mongos hosts exhibiting the problem and paste the output in this ticket so we can see which kernel you are running?

Thanks in advance.

-Kal.

Comment by Darshan Shah [ 04/Nov/16 ]

Unfortunately, I do not have the full thread dump from that particular date.
Attaching the full thread dump & the mongos log file from today from another server where I see this same problem.
Note that network connectivity between the problem server and other servers is good - I can ping and ssh in both directions.

Let me know if you need any other info.

Thanks.

Comment by Kaloian Manassiev [ 01/Nov/16 ]

Thank you darshan.shah@interactivedata.com for providing us with the call stacks - this is really helpful.

The two stacks show that there are asynchronous networking threads which are stuck doing hostname resolution and are blocked in the OS. However from them I am unable to see whether all the client threads are also blocked on this hostname resolution or on something else.

Would it be possible to attach all the threads' stacks if you still have them?

Alternatively you can use this command, which will output them to a file:

gdb -ex "thread apply all bt" -p $(pidof mongos) > mongos.stacks

Thank you in advance.

-Kal.

Comment by Darshan Shah [ 26/Oct/16 ]

Here is the server info:

$> uname -a
Linux bxb-ppe-oas013 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
$> cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.1 (Maipo)

We observed that the number of threads in a stuck mongos process are way higher than on the one that's working fine.
Here is the stack trace from gdb on a stuck mongos process:

Thread 314 (Thread 0x7f2053ead700 (LWP 12953)):
#0  0x00007f20872dc11d in recvmsg () from /lib64/libc.so.6
#1  0x00007f20872fe90d in make_request () from /lib64/libc.so.6
#2  0x00007f20872fee04 in __check_pf () from /lib64/libc.so.6
#3  0x00007f20872c4d31 in getaddrinfo () from /lib64/libc.so.6
#4  0x0000000000c8a9b7 in asio::detail::socket_ops::getaddrinfo(char const*, char const*, addrinfo const&, addrinfo**, std::error_code&) ()
#5  0x0000000000c8aae3 in asio::detail::socket_ops::background_getaddrinfo(std::weak_ptr<void> const&, char const*, char const*, addrinfo const&, addrinfo**, std::error_code&) ()
#6  0x0000000000a221c1 in asio::detail::resolve_op<asio::ip::tcp, asio::detail::wrapped_handler<asio::io_service::strand, mongo::executor::NetworkInterfaceASIO::_connect(mongo::executor::NetworkInterfaceASIO::AsyncOp*)::{lambda(std::error_code, asio::ip::basic_resolver_iterator<asio::ip::tcp>)#1}, asio::detail::is_continuation_if_running> >::do_complete(void*, asio::detail::scheduler_operation*, std::error_code const&, unsigned long) ()
#7  0x0000000000c8c3e1 in asio::detail::scheduler::do_run_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&, asio::detail::scheduler_thread_info&, std::error_code const&) ()
#8  0x0000000000c8c601 in asio::detail::scheduler::run(std::error_code&) ()
#9  0x0000000000c9079f in asio::io_service::run() ()
#10 0x0000000000c83f40 in asio_detail_posix_thread_function ()
---Type <return> to continue, or q <return> to quit---
#11 0x00007f20875addf5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f20872db1ad in clone () from /lib64/libc.so.6
 
Thread 313 (Thread 0x7f20536ac700 (LWP 12957)):
#0  0x00007f20872e8b6c in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f20872feefd in _L_lock_746 () from /lib64/libc.so.6
#2  0x00007f20872fecb5 in __check_pf () from /lib64/libc.so.6
#3  0x00007f20872c4d31 in getaddrinfo () from /lib64/libc.so.6
#4  0x0000000000c8a9b7 in asio::detail::socket_ops::getaddrinfo(char const*, char const*, addrinfo const&, addrinfo**, std::error_code&) ()
#5  0x0000000000c8aae3 in asio::detail::socket_ops::background_getaddrinfo(std::weak_ptr<void> const&, char const*, char const*, addrinfo const&, addrinfo**, std::error_code&) ()
#6  0x0000000000a221c1 in asio::detail::resolve_op<asio::ip::tcp, asio::detail::wrapped_handler<asio::io_service::strand, mongo::executor::NetworkInterfaceASIO::_connect(mongo::executor::NetworkInterfaceASIO::AsyncOp*)::{lambda(std::error_code, asio::ip::basic_resolver_iterator<asio::ip::tcp>)#1}, asio::detail::is_continuation_if_running> >::do_complete(void*, asio::detail::scheduler_operation*, std::error_code const&, unsigned long) ()
#7  0x0000000000c8c3e1 in asio::detail::scheduler::do_run_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&, asio::detail::scheduler_thread_info&, std::error_code const&) ()
#8  0x0000000000c8c601 in asio::detail::scheduler::run(std::error_code&) ()
#9  0x0000000000c9079f in asio::io_service::run() ()
#10 0x0000000000c83f40 in asio_detail_posix_thread_function ()
#11 0x00007f20875addf5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f20872db1ad in clone () from /lib64/libc.so.6

Comment by Darshan Shah [ 21/Oct/16 ]

After restarting the problematic mongos, it works fine.

In the stalled mongos logs, I see it tries to connect to chi-ppe-oas009 on port 29111 which is the primary mongod for that replicaset.

So attaching the full mongod log for chi-ppe-oas009 and the log for the stalled mongos (on bxb-ppe-oas010) after the restart.

Comment by Ramon Fernandez Marina [ 21/Oct/16 ]

Thanks for the report, we'll take a look at the logs for clues.

Generated at Thu Feb 08 04:13:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.