[SERVER-30628] Timeouts between 'mongos' and replicaset nodes rejecting requests Created: 12/Aug/17  Updated: 07/Jan/18  Resolved: 08/Dec/17

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 3.4.0, 3.4.3, 3.4.4, 3.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Lucas Assignee: Mark Agarunov
Resolution: Incomplete Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File connections.png     PNG File ftdc.png     Text File mongo-shard-ibm8a-gdb.txt     Text File mongo-zeroshard4a-gdb.txt     Text File reportNetwork_mongo-20170811.txt    
Issue Links:
Duplicate
Related
related to SERVER-30643 Performance regression with SSL Closed
Operating System: ALL
Participants:

 Description   

We had several issues in the past few days under our applications, and after further analysis, we found out that they are all related to our running instances of sharded MongoDB. During this examination, we noticed that our servers were reporting a lot of connection timeouts. After analyzing our network we discovered that it didn't have any performance issues. The tests results are provided along with this ticket in 'reportNetwork_mongo-20170811.txt' file.

Also, we found some links on the web of people having similar problems to this one that we're facing; this is a group of links that we found:

https://jira.mongodb.org/browse/SERVER-24711
https://stackoverflow.com/questions/38485285/mongodb-replica-heartbeat-request-time-exceeded
https://jira.mongodb.org/browse/SERVER-24058
https://github.com/rofl0r/proxychains-ng/issues/171

But even so, we think we are facing an undocumented problem here (maybe related to those others).

A description of the problem:
Some requests made to our sharded clusters give us a timeout message (found at the log from the DB) for no apparent reason or pattern that we could figure out. If we try to perform the same request right after the problematic one, it may works without any problems but sometimes It don't. We even implemented this retry at our application code to mitigate the problem, but since it takes much time to timeout and make the request again, our client experience some slowness due to this issue and making this retry isn't a satisfactory fix to the problem for us.

Currently, we have three different clusters running sharded MongoDB. One of them is operating at Softlayer’s(IBM) data center, and two others at OVH. We're going to focus our data provided in this ticket on the one from IBMs despite also having problems on the other two, just because we think that we're already providing a lot of data to you guys, but if needed we can provide data from the other two clusters.

The cluster is composed of (all logs will be relative to this cluster):
Servers hosted in IBM’s data center.
3 MongoDB(3.4.4) Servers running in a VM with 32GB RAM and 8 cores.
3 MongoDB(3.4.4) Configs running in a VM with 2GB RAM and 2 cores.
8 nodes(3.4.4) with the replication factor of 3 Bare metal with 256GB RAM and 40 cores.
All of them communicate between each other using 1Gbps network.

We can say these issues aren't related to infrastructure problems at the DC, as the all of our clusters are having this issues, regardless of the infrastructure provider. Also, it is important to report that the cluster size is not a factor in this problem because we have a tiny cluster of only two servers and they are facing the same problems.

Our DBA also tried to fine tune some configurations and had no luck in fixing the problem. The tuned configurations were:
WiredTigerConcurrentReadOperations – from default to 256
WiredTigerConcurrentWriteOperations – from default to 256
ShardingTaskExecutorPoolHostTimeoutMS – from default to 7200000
ShardingTaskExecutorPoolMinSize – from default to 10
taskExecutorPoolSize – from default to 16 (twice the number of cores)

We also tried to change the max connection pool but with no success.

Our development team also tried different versions of the Java's mongo driver, also with no positive results. (currently using version 3.4.2)

com.mongodb.MongoExecutionTimeoutException: Operation timed out, request was RemoteCommand 254187776 -- target:mongo-shard-geral7a.foobar.com:7308 db:admin cmd:{ isMaster: 1 }

com.mongodb.MongoExecutionTimeoutException: Couldn't get a connection within the time limit

We also tried to upgrade and downgrade the MongoDB binary versions (3.4.6, 3.4.3, 3.4.0) none of them had any effects to solve our problem.

We have several individual replica-set deployments, and none of them faced this issue in the past three years.

The data provided with this ticket is:
All MongoDB Nodes (from the given cluster) logs and diagnostic.data files (past 3 days).
All MongoDB Servers (from the given cluster) logs (past 3 days).
All MongoDB Configs (from the given cluster) logs (past 3 days).
Network tests between a 'mongos', a config server and a primary node.

IP’s and DNS’ has been hidden or modified for security reasons.



 Comments   
Comment by Mark Agarunov [ 08/Dec/17 ]

Hello lucasoares

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Thanks,
Mark

Comment by Mark Agarunov [ 01/Dec/17 ]

Hello lucasoares,

Is this still an issue for you are did the upgrade fix the behavior you were seeing?

Thanks,
Mark

Comment by Lucas [ 10/Nov/17 ]

mark.agarunov important: we tried version 3.4.0 and 3.4.3 too. This SERVER-30643 affects only 3.4.4+, correct?

I really don't think this will complete solve the problem. But I will give a try.

Comment by Lucas [ 10/Nov/17 ]

Hello mark.agarunov.

I will upgrade both clusters in this weekend.

A question: I still see, even without SSL, connections taking more than 500ms (sometimes more than 4 seconds). What to do about those cases?

They are rare, but still occurs. Examples:

2017-11-10T18:04:22.768+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Successfully connected to mongo-shard-ibm8a:27017, took 5006ms (16 connections now open to mongo-shard-ibm8a:27017)
2017-11-10T18:04:22.774+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Successfully connected to mongo-shard-ibm8a:27017, took 4941ms (16 connections now open to mongo-shard-ibm8a:27017)
2017-11-10T18:04:22.776+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Successfully connected to mongo-shard-ibm8a:27017, took 4804ms (16 connections now open to mongo-shard-ibm8a:27017)
2017-11-10T18:04:22.777+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Successfully connected to mongo-shard-ibm8a:27017, took 4744ms (16 connections now open to mongo-shard-ibm8a:27017)
2017-11-10T18:10:56.362+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Successfully connected to mongo-shard-ibm9a:27017, took 5012ms (11 connections now open to mongo-shard-ibm9a:27017)
2017-11-10T18:10:56.372+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Successfully connected to mongo-shard-ibm9a:27017, took 4994ms (11 connections now open to mongo-shard-ibm9a:27017)

2017-11-10T17:12:23.392+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Successfully connected to mongo-shard-ibm7c:27017, took 5010ms (8 connections now open to mongo-shard-ibm7c:27017)
2017-11-10T18:29:05.534+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-6-0] Successfully connected to mongo-shard-ibm9c:27017, took 5018ms (6 connections now open to mongo-shard-ibm9c:27017)
2017-11-10T18:31:48.835+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Successfully connected to mongo-shard-ibm5c:27017, took 5007ms (12 connections now open to mongo-shard-ibm5c:27017)

2017-11-10T16:33:01.118+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-5-0] Successfully connected to mongo-shard-ibm6b:27017, took 5018ms (10 connections now open to mongo-shard-ibm6b:27017)
2017-11-10T18:10:56.364+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Successfully connected to mongo-shard-ibm9a:27017, took 5014ms (9 connections now open to mongo-shard-ibm9a:27017)

Comment by Mark Agarunov [ 10/Nov/17 ]

Hello lucasoares,

Thank you for providing this additional data. After further investigation, I suspect this behavior may be due to a bug related to SSL, as detailed in SERVER-30643, which was fixed in 3.4.9. If possible, please upgrade to 3.4.9 or newer and see if this fixes the issue.

Thanks,
Mark

Comment by Lucas [ 09/Nov/17 ]

I saw those connections taking a long time to connect in the cluster mentioned in the description (still using preferSSL mode):

2017-11-09T18:31:46.216+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Successfully connected to mongo-zeroshard4a:27017, took 3323ms (32 connections now open to mongo-zeroshard4a:27017)
2017-11-09T18:31:46.217+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Successfully connected to mongo-zeroshard4a:27017, took 3539ms (32 connections now open to mongo-zeroshard4a:27017)
2017-11-09T18:31:46.260+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Successfully connected to mongo-zeroshard4a:27017, took 2883ms (29 connections now open to mongo-zeroshard4a:27017)
2017-11-09T18:33:10.745+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Successfully connected to mongo-zeroshard4a:27017, took 938ms (29 connections now open to mongo-zeroshard4a:27017)
2017-11-09T18:33:10.745+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Successfully connected to mongo-zeroshard4a:27017, took 2821ms (29 connections now open to mongo-zeroshard4a:27017)
2017-11-09T18:33:10.752+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Successfully connected to mongo-zeroshard4a:27017, took 2469ms (29 connections now open to mongo-zeroshard4a:27017)

I executed the command you ask in the 'mongo-zeroshard4a' and uploaded the file 'mongo-zeroshard4a-gdb.txt'.

I also executed the command in one mongod from the cluster I mentioned in the comments (using allowSSL mode). I saw a connection taking 50 seconds to complete in the mongos log and I executed the command in the 'mongod' target. I uploaded the file called 'mongo-shard-ibm8a-gdb.txt'.

In both mongod the command show this error while exeuting:

185	../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
185	../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
185	../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
185	../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
185	../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.

Comment by Lucas [ 09/Nov/17 ]

Hello mark.agarunov. Since this cluster is a production one, we did a lot of things to find a workaround. One of those workarounds was disabling preferSSL and changing all cluster to allowSSL mode; then we changed our applications to connect in the MongoDB without ssl protocol. We did this two days ago, and we saw a huge decreased in the connection time.

The command you asked to run will work even without preferSSL mode?

And I'm confused now. You asked to run the command in a 'mongod', right? What I can do is find the shard that is taking the most time to connect (seeing the mongos log) and execute the command there. Will this work?

Another thing I can do is to execute this command in the other cluster (the one I mentioned in the description, not in the comment) and maybe we will get more information about the issue. In this cluster connections are taking a little bit longer to connect.

For documentation, what I did so far searching for a workaround:

1. Deployed one mongos per application (this starts to work but isn't a good, since we have more than 100 running applications).
2. Removed the ssl protocol from all applications connecting to the mongos.
3. Deployed 15 applications using only 3 mongos addresses to connect. I did this yesterday to find a solution to the workaround 1, since we can't manage those many mongos servers. It's working so far.
4. Limited the pool size of the Java driver to 35 connections. We did this to prevent connection pooling in the mongos. When we saw the connection time growing up to 40 seconds, we figure out the 'mongos' are creating a whole bunch of connection (and killing them when they become idle). This was a infinite loop and limiting the pool size maybe prevents this happens.
5. Changed the read preference to secondary preferred. We know the production notes doesn't advise secondary reads, but so far this really improved our connections. We don't have any aggressive load to the primaries since they have SSD, lots of CPUs, RAM and a good network. I think the connection pooling with secondaries maybe prevent the 'mongos' of creating lots of connections and stay with those connections already open.

Sometimes we still see some connections taking 600ms~5s to connect, but in the most cases connections only take 5~20ms, like this:

2017-11-09T17:49:58.314+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-4-0] Successfully connected to mongo-shard-ibm2a:27017, took 9ms (6 connections now open to mongo-shard-ibm2a:27017)

Those times when the connection needs 5 seconds to establish worries me a lot. I'm trying to create a tcpdump to get some insights of it.

We previously had 20 mongos, and the application always connects to all of them (using all addresses in java driver connector). We expect the driver balance do the work, but that's how it all started. It appeared that only a few of them were receiving operations in this case.

Comment by Mark Agarunov [ 09/Nov/17 ]

Hello lucasoares,

Thank you for the additional information. After looking over this some more, I suspect this may be due to an issue when using requireSSL or preferSSL mode. To confirm that this is the cause, could you please execute the following on an affected mongod node when you are seeing this issue and provide the gdb.txt file that is generated?

for i in $(seq 5); do gdb -ex "thread apply all bt" --batch --pid $(pidof mongod); date; sleep 10; done > gdb.txt

Thanks,
Mark

Comment by Lucas [ 31/Oct/17 ]

Hello again!

I set a cuple parameters and then the 'mongos' aren't exceeding timeout to connect to shards:
cursorTimeoutMillis: '1800000'
ShardingTaskExecutorPoolHostTimeoutMS: '300000000'
ShardingTaskExecutorPoolRefreshTimeoutMS: '600000'

Now I can see how long connections are taking to stabilish (around 10~40 seconds).

I don't know why, since all my servers have a 10Gbps network and average ping of 0.02ms. Manually connecting to the shards are really fast.

My mongos have an average of 2k connections, and my shards an average of 4k.

I will keep diagnosing. Thanks!

Comment by Lucas [ 30/Oct/17 ]

Using glibc 2.23 doesn't solved the problem.

Comment by Lucas [ 30/Oct/17 ]

Hello.

I'm investigating older issues in the same component and I found interesting things.

First of all, I'm using RHEL 7.3 (kernel 3.10 and glibc 2.17) on my mongo servers. The issue SERVER-26723 says, and I quote:

We have completed our investigation and concluded that the issue described in this ticket is caused by a bug in glibc.

The issue SERVER-26723 doesn't have the same error pattern of this issue, but I can't ignore and I will install 5 new mongo servers with one SO with kernel v4 and then I will try to replicate the same problem.

I think issues SERVER-26654 and SERVER-29206 have more likely to be the same problem as I have, but with different mongo verions. One of the reporters also have a different operating system.

Comment by Lucas [ 25/Oct/17 ]

OK. Please, ask anything you need to debug this. I'm here to help because I really want this problem fixed or a good explanation to help me to solve this.

Thanks.

Comment by Mark Agarunov [ 25/Oct/17 ]

Hello lucasoares,

Thank you for providing the additional information. Unfortunately we have not yet determined the cause of this behavior and are still investigating. However the log you provided should give some additional insight into this.

Thanks,
Mark

Comment by Lucas [ 25/Oct/17 ]

I uploaded a log file with verbose level 5 of one mongos with this same issue. I hope this helps. The file name is 'logging_verbose_5.7z'.

Comment by Lucas [ 24/Oct/17 ]

Hello mark.agarunov. I'm still having this issue but now with a brand new cluster from a different datacenter with a different server provider. All my team believe the problem with MongoDB, because we could not find a solution or a functional workaround.

We are trying to add lot of 'mongos' instances, and some of those are beeing used by one application of ours. This minimized the problem but is not a good solution to us, since we have more than 300 application servers running.

We tried to use the same NTP server in all our servers as advised by one of MongoDB internal employee.

We tried to change lots of infrastructure configuration on our servers following the production notes of MongoDB.

We tried lot of things, and nothing are helping.

Maybe the problem is due to an infrastructure configuration, but even different versions of OS we tried (Ubuntu 14.04 and 16.04). Unfortunately I'm not having any success to prove that isn't a MongoDB issue. I really need help at least saying what kind of parameter or configuration I can try to change, or at least one update of this issue.

Thanks!

Comment by Mark Agarunov [ 22/Sep/17 ]

Hello lucasoares,

Thank you for providing these files. My apologies for the delay in response, unfortunately we have not yet determined the cause of this behavior, but we are still investigating this issue. We will provide updates on this as they're available.

Thanks,
Mark

Comment by Lucas [ 14/Aug/17 ]

Thanks ramon.fernandez. We uploaded the files.

Comment by Ramon Fernandez Marina [ 14/Aug/17 ]

lucasoares, I've created an upload portal so you can send logs and diagnostic data; please let us know when the uploads are complete.

Thanks,
Ramón.

Comment by Lucas [ 12/Aug/17 ]

Unfortunately there is a limit of 150MB for attachments. We have tons of logs and diagnostic.data files from past 3 days to upload.

Can you guys open a private portal to upload these files?

Thanks.

Generated at Thu Feb 08 04:24:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.