[SERVER-26723] Mongos stalls for even simple queries Created: 21/Oct/16 Updated: 01/Feb/18 Resolved: 28/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Darshan Shah | Assignee: | Kaloian Manassiev |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
We have a 32 node cluster where each node is a 3 member replicaset having a hidden member. Queries like count() or show dbs or show collections that most probably access only the metadata - work fine on the mongos of the affected machine. However, running any simple query stalls the mongos. Attaching mongos logs for a mongos process that is not responding as well the config server primary node log. |
| Comments |
| Comment by Darshan Shah [ 01/Feb/18 ] | ||||||||||||||||||||||||||||||||||
|
Just FYI - the original problem was resolved when we upgraded MongoDb to 3.2.11 and also upgraded glibc to 2.17. However now we are noticing problems similar to SERVER-26722 and SERVER-29206 even after upgrading to MongoDb 3.2.18 running on RHEL7.1. | ||||||||||||||||||||||||||||||||||
| Comment by Vick Mena (Inactive) [ 01/Dec/16 ] | ||||||||||||||||||||||||||||||||||
|
To Summarize:
Recommendations:
As a reminder, RHEL 7.3 has kernel 3.10 and glibc 2.17 | ||||||||||||||||||||||||||||||||||
| Comment by Roy Reznik [ 29/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
The bug that you linked to is still in the state "NEW" meaning that there is no operating system that contains a fix for that. | ||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 28/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
Hi all, We have completed our investigation and concluded that the issue described in this ticket is caused by a bug in glibc. Therefore, we will be resolving this ticket, as there is no work to be done on the MongoDB server to correct this behavior. To ensure that you are not affected by this issue, please upgrade your operating system to a version that contains the fix for https://bugzilla.kernel.org/show_bug.cgi?id=99671 Please be aware that this issue is distinct from Thank you, | ||||||||||||||||||||||||||||||||||
| Comment by Roy Reznik [ 27/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
Any news on this? | ||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 22/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
Hi alessandro.gherardi@yahoo.com, Thanks for uploading the logs. We are still investigating this issue and will let update this ticket when we know more. Kind regards, | ||||||||||||||||||||||||||||||||||
| Comment by Alessandro Gherardi [ 16/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
@Kaloian Manassiev - I uploaded log files from our server to https://jira.mongodb.org/browse/SERVER-26654 . I understand that ticket is now closed - feel free to move those log files here. | ||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 16/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
jon@appboy.com, unfortunately we have not yet been able to confirm the root cause for this issue, so it can't be part of 3.2.11. As kaloian.manassiev mentioned above we're trying to determine if the problem is related to the glibc bug listed above. Thanks, | ||||||||||||||||||||||||||||||||||
| Comment by Jon Hyman [ 16/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
I noticed that the 3.2.11-rc1 note came out today - is this going to be targeted for 3.2.11? We are unable to upgrade because of this issue / | ||||||||||||||||||||||||||||||||||
| Comment by Darshan Shah [ 14/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
[oasprod@bxb-ppe-oas002 tmp]$ uname -a | ||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 11/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
Thanks a lot, darshan.shah@interactivedata.com for providing the complete thread stack traces. Based on this information I see that all but one of the asio host name resolution threads are stuck in the hostname resolution call:
and one thread has not returned from a recvmsg call:
This seems to be similar to a deadlock bug reported for glibc and we are now working on confirming that:
Can you please run "uname -a" on one of the mongos hosts exhibiting the problem and paste the output in this ticket so we can see which kernel you are running? Thanks in advance. -Kal. | ||||||||||||||||||||||||||||||||||
| Comment by Darshan Shah [ 04/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
Unfortunately, I do not have the full thread dump from that particular date. Let me know if you need any other info. Thanks. | ||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 01/Nov/16 ] | ||||||||||||||||||||||||||||||||||
|
Thank you darshan.shah@interactivedata.com for providing us with the call stacks - this is really helpful. The two stacks show that there are asynchronous networking threads which are stuck doing hostname resolution and are blocked in the OS. However from them I am unable to see whether all the client threads are also blocked on this hostname resolution or on something else. Would it be possible to attach all the threads' stacks if you still have them? Alternatively you can use this command, which will output them to a file:
Thank you in advance. -Kal. | ||||||||||||||||||||||||||||||||||
| Comment by Darshan Shah [ 26/Oct/16 ] | ||||||||||||||||||||||||||||||||||
|
Here is the server info:
We observed that the number of threads in a stuck mongos process are way higher than on the one that's working fine.
| ||||||||||||||||||||||||||||||||||
| Comment by Darshan Shah [ 21/Oct/16 ] | ||||||||||||||||||||||||||||||||||
|
After restarting the problematic mongos, it works fine. In the stalled mongos logs, I see it tries to connect to chi-ppe-oas009 on port 29111 which is the primary mongod for that replicaset. So attaching the full mongod log for chi-ppe-oas009 and the log for the stalled mongos (on bxb-ppe-oas010) after the restart. | ||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 21/Oct/16 ] | ||||||||||||||||||||||||||||||||||
|
Thanks for the report, we'll take a look at the logs for clues. |