[SERVER-30628] Timeouts between 'mongos' and replicaset nodes rejecting requests Created: 12/Aug/17 Updated: 07/Jan/18 Resolved: 08/Dec/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Sharding |
| Affects Version/s: | 3.4.0, 3.4.3, 3.4.4, 3.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Lucas | Assignee: | Mark Agarunov |
| Resolution: | Incomplete | Votes: | 5 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We had several issues in the past few days under our applications, and after further analysis, we found out that they are all related to our running instances of sharded MongoDB. During this examination, we noticed that our servers were reporting a lot of connection timeouts. After analyzing our network we discovered that it didn't have any performance issues. The tests results are provided along with this ticket in 'reportNetwork_mongo-20170811.txt' file. Also, we found some links on the web of people having similar problems to this one that we're facing; this is a group of links that we found: https://jira.mongodb.org/browse/SERVER-24711 But even so, we think we are facing an undocumented problem here (maybe related to those others). A description of the problem: Currently, we have three different clusters running sharded MongoDB. One of them is operating at Softlayer’s(IBM) data center, and two others at OVH. We're going to focus our data provided in this ticket on the one from IBMs despite also having problems on the other two, just because we think that we're already providing a lot of data to you guys, but if needed we can provide data from the other two clusters. The cluster is composed of (all logs will be relative to this cluster): We can say these issues aren't related to infrastructure problems at the DC, as the all of our clusters are having this issues, regardless of the infrastructure provider. Also, it is important to report that the cluster size is not a factor in this problem because we have a tiny cluster of only two servers and they are facing the same problems. Our DBA also tried to fine tune some configurations and had no luck in fixing the problem. The tuned configurations were: We also tried to change the max connection pool but with no success. Our development team also tried different versions of the Java's mongo driver, also with no positive results. (currently using version 3.4.2)
We also tried to upgrade and downgrade the MongoDB binary versions (3.4.6, 3.4.3, 3.4.0) none of them had any effects to solve our problem. We have several individual replica-set deployments, and none of them faced this issue in the past three years. The data provided with this ticket is: IP’s and DNS’ has been hidden or modified for security reasons. |
| Comments |
| Comment by Mark Agarunov [ 08/Dec/17 ] | |||||||||||
|
Hello lucasoares We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Thanks, | |||||||||||
| Comment by Mark Agarunov [ 01/Dec/17 ] | |||||||||||
|
Hello lucasoares, Is this still an issue for you are did the upgrade fix the behavior you were seeing? Thanks, | |||||||||||
| Comment by Lucas [ 10/Nov/17 ] | |||||||||||
|
mark.agarunov important: we tried version 3.4.0 and 3.4.3 too. This I really don't think this will complete solve the problem. But I will give a try. | |||||||||||
| Comment by Lucas [ 10/Nov/17 ] | |||||||||||
|
Hello mark.agarunov. I will upgrade both clusters in this weekend. A question: I still see, even without SSL, connections taking more than 500ms (sometimes more than 4 seconds). What to do about those cases? They are rare, but still occurs. Examples:
| |||||||||||
| Comment by Mark Agarunov [ 10/Nov/17 ] | |||||||||||
|
Hello lucasoares, Thank you for providing this additional data. After further investigation, I suspect this behavior may be due to a bug related to SSL, as detailed in Thanks, | |||||||||||
| Comment by Lucas [ 09/Nov/17 ] | |||||||||||
|
I saw those connections taking a long time to connect in the cluster mentioned in the description (still using preferSSL mode):
I executed the command you ask in the 'mongo-zeroshard4a' and uploaded the file 'mongo-zeroshard4a-gdb.txt'. I also executed the command in one mongod from the cluster I mentioned in the comments (using allowSSL mode). I saw a connection taking 50 seconds to complete in the mongos log and I executed the command in the 'mongod' target. I uploaded the file called 'mongo-shard-ibm8a-gdb.txt'. In both mongod the command show this error while exeuting:
| |||||||||||
| Comment by Lucas [ 09/Nov/17 ] | |||||||||||
|
Hello mark.agarunov. Since this cluster is a production one, we did a lot of things to find a workaround. One of those workarounds was disabling preferSSL and changing all cluster to allowSSL mode; then we changed our applications to connect in the MongoDB without ssl protocol. We did this two days ago, and we saw a huge decreased in the connection time. The command you asked to run will work even without preferSSL mode? And I'm confused now. You asked to run the command in a 'mongod', right? What I can do is find the shard that is taking the most time to connect (seeing the mongos log) and execute the command there. Will this work? Another thing I can do is to execute this command in the other cluster (the one I mentioned in the description, not in the comment) and maybe we will get more information about the issue. In this cluster connections are taking a little bit longer to connect. For documentation, what I did so far searching for a workaround: 1. Deployed one mongos per application (this starts to work but isn't a good, since we have more than 100 running applications). Sometimes we still see some connections taking 600ms~5s to connect, but in the most cases connections only take 5~20ms, like this:
Those times when the connection needs 5 seconds to establish worries me a lot. I'm trying to create a tcpdump to get some insights of it. We previously had 20 mongos, and the application always connects to all of them (using all addresses in java driver connector). We expect the driver balance do the work, but that's how it all started. It appeared that only a few of them were receiving operations in this case. | |||||||||||
| Comment by Mark Agarunov [ 09/Nov/17 ] | |||||||||||
|
Hello lucasoares, Thank you for the additional information. After looking over this some more, I suspect this may be due to an issue when using requireSSL or preferSSL mode. To confirm that this is the cause, could you please execute the following on an affected mongod node when you are seeing this issue and provide the gdb.txt file that is generated?
Thanks, | |||||||||||
| Comment by Lucas [ 31/Oct/17 ] | |||||||||||
|
Hello again! I set a cuple parameters and then the 'mongos' aren't exceeding timeout to connect to shards: Now I can see how long connections are taking to stabilish (around 10~40 seconds). I don't know why, since all my servers have a 10Gbps network and average ping of 0.02ms. Manually connecting to the shards are really fast. My mongos have an average of 2k connections, and my shards an average of 4k. I will keep diagnosing. Thanks! | |||||||||||
| Comment by Lucas [ 30/Oct/17 ] | |||||||||||
|
Using glibc 2.23 doesn't solved the problem. | |||||||||||
| Comment by Lucas [ 30/Oct/17 ] | |||||||||||
|
Hello. I'm investigating older issues in the same component and I found interesting things. First of all, I'm using RHEL 7.3 (kernel 3.10 and glibc 2.17) on my mongo servers. The issue
The issue I think issues | |||||||||||
| Comment by Lucas [ 25/Oct/17 ] | |||||||||||
|
OK. Please, ask anything you need to debug this. I'm here to help because I really want this problem fixed or a good explanation to help me to solve this. Thanks. | |||||||||||
| Comment by Mark Agarunov [ 25/Oct/17 ] | |||||||||||
|
Hello lucasoares, Thank you for providing the additional information. Unfortunately we have not yet determined the cause of this behavior and are still investigating. However the log you provided should give some additional insight into this. Thanks, | |||||||||||
| Comment by Lucas [ 25/Oct/17 ] | |||||||||||
|
I uploaded a log file with verbose level 5 of one mongos with this same issue. I hope this helps. The file name is 'logging_verbose_5.7z'. | |||||||||||
| Comment by Lucas [ 24/Oct/17 ] | |||||||||||
|
Hello mark.agarunov. I'm still having this issue but now with a brand new cluster from a different datacenter with a different server provider. All my team believe the problem with MongoDB, because we could not find a solution or a functional workaround. We are trying to add lot of 'mongos' instances, and some of those are beeing used by one application of ours. This minimized the problem but is not a good solution to us, since we have more than 300 application servers running. We tried to use the same NTP server in all our servers as advised by one of MongoDB internal employee. We tried to change lots of infrastructure configuration on our servers following the production notes of MongoDB. We tried lot of things, and nothing are helping. Maybe the problem is due to an infrastructure configuration, but even different versions of OS we tried (Ubuntu 14.04 and 16.04). Unfortunately I'm not having any success to prove that isn't a MongoDB issue. I really need help at least saying what kind of parameter or configuration I can try to change, or at least one update of this issue. Thanks! | |||||||||||
| Comment by Mark Agarunov [ 22/Sep/17 ] | |||||||||||
|
Hello lucasoares, Thank you for providing these files. My apologies for the delay in response, unfortunately we have not yet determined the cause of this behavior, but we are still investigating this issue. We will provide updates on this as they're available. Thanks, | |||||||||||
| Comment by Lucas [ 14/Aug/17 ] | |||||||||||
|
Thanks ramon.fernandez. We uploaded the files. | |||||||||||
| Comment by Ramon Fernandez Marina [ 14/Aug/17 ] | |||||||||||
|
lucasoares, I've created an upload portal so you can send logs and diagnostic data; please let us know when the uploads are complete. Thanks, | |||||||||||
| Comment by Lucas [ 12/Aug/17 ] | |||||||||||
|
Unfortunately there is a limit of 150MB for attachments. We have tons of logs and diagnostic.data files from past 3 days to upload. Can you guys open a private portal to upload these files? Thanks. |