[SERVER-28629] router blocks and throws ExceededTimeLimit Created: 04/Apr/17 Updated: 08/Jan/24 Resolved: 22/Jun/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Sharding |
| Affects Version/s: | 3.2.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Mira Carey |
| Resolution: | Duplicate | Votes: | 4 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | We can reproduce the issue at any time just by executing a findOne through the router several times:
It blocks after a few findOne's already. |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
We are using a new sharded cluster running v3.2.12. Our cluster is not operational because many operations get blocked by the the router. The corresponding log message looks like this:
We observe this behaviour independent on whether the query uses the shardkey or not. In all cases the queried field is indexed. A downgrade of the routers to v3.0.12 ist not possible because our configservers are running as replicaset instead of a mirrored set. Please see also 2 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is very high. This ticket is related to |
| Comments |
| Comment by Mira Carey [ 22/Jun/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Now that we've picked a release for the backport of While you're welcome to stick with other workarounds if they're stable for you, maxconnecting may also be a viable option. That will depend on whether the initial latency of spinning up new connections is tolerable once we drive down connection churn. If you have any questions, feel free to re-open or open a new ticket as needed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hyun Gul Roh [X] [ 17/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks, @kay.agahd If the connections from clients are stable, all the time_wait are caused by closing the connections to mongodata? I'd like to ask a question mongoDB engineers. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 16/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Yes. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hyun Gul Roh [X] [ 16/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank, kay.agahd. If the connection from clients are no more made, our case seems different with this issue. Anyway, thank you for your recommendation. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 16/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Hyun Gul Roh,
The number of connections remained stable so I think the connections got not closed.
We are using mongos v3.2.12 with following added parameters:
Since then, no crashes, no blocking. All seems to be fine so far. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hyun Gul Roh [X] [ 16/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, all. We've got very similar issues on our mongoDB cluster.
mongos02 690
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jon Hyman [ 11/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To follow up on Wayne's post, when this is happening the mongoS by and large stops responding to queries and we get operation failures with "Couldn't get a connection within the time limit (50)" – this by and large takes down our application. What other potential options do we have for mitigation? Here is our glibc setting, so we don't think it's that.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Wayne Egerer [ 11/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey @Jason Carey, We are running mongos v3.2.12 The cluster is busy; so we suspect this is why we see the issue while others may not be with the config changes recommended. 2017-05-09T18:13:21.762+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to c0-r9-data-001.mongo:27017 - ExceededTimeLimit: Operation timed out | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 08/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi mira.carey@mongodb.com, we followed your suggestion and tried 2 different configurations one by one while running mongos v3.2.12
After 12 minutes runtime, queries got blocked and tcp-timewait increased. So we had to stop this mongos and tried the next config: Config 2: we multiplied the default value of ShardingTaskExecutorPoolHostTimeoutMS by 12 and ShardingTaskExecutorPoolMinSize by 100 resulting in:
Mongos did run more than 1 hour without blocking. No tcp-timewait spikes were observed. So it seems that the second config could be a workaround. We could also try to set the values of the first config to extremely high values, say 1 year or something alike, so that the spikes would only happen after one year if we didn't restart the router in the meantime (which happens normally earlier). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 08/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, We are having the same issue, (a more detailed description on
The ShardingTaskExecutorPoolMinSize came from the fact that the shards were maintain around 2000 connections prior to the change. We are running with 4 mongos and each mongos has 6 CPUs (so 6 connection pools) so I am expecting to have at least 4*6*100=2400 connections which is close to what the cluster needs. There is no good reason why we picked ShardingTaskExecutorPoolHostTimeoutMS= 300000000. It was the only value that during the benchmarks in our test envairoment didn't produced the exception but as the test continues we might be able to determine a lower value for it. So far I haven't noticed any "Failed to connect to - ExceededTimeLimit: Operation timed out" on the logs but its only 4 hours since we made the change. I will have some feedback after 24-48 hours and when the ShardingTaskExecutorPoolHostTimeoutMS gets exhausted. Ant | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 05/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi mira.carey@mongodb.com thank you for the helpful information. We will test your suggested configuration next week and report back. For your information: we can reproduce the issue with mongos v3.2.9. This means that the bug must have been introduced in v3.2.9. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mira Carey [ 04/May/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If you're having success with with 3.2.8, it's likely that you're benefiting from the connection pool never shrinking and never heartbeating (forcing a task to way while the heartbeat occurs), and therefore, never experiencing the cost of creating a new connection to a loaded mongod. You can replicate that behavior with a set of setparams that were made available starting in 3.2.11 (via ShardingTaskExecutorPoolHostTimeoutMS
ShardingTaskExecutorPoolRefreshRequirementMS
ShardingTaskExecutorPoolMinSize
Setting the host timeout and refresh requirement to very high values will effectively reproduce the behavior of the connection pooling in 3.2.8 Another option would be to increase the host timeout and min pool size. This would limit connection churn from spiky traffic by keeping connections around through lulls. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 19/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi acm, I just forgot to confirm that you're right with your definition of the "bad" and the "good" cluster made on Apr 13 2017 08:48:23 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 19/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi acm I'm on sick leave until end of this week. My colleague may follow up if his time allows. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 18/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi kay.agahd@idealo.de - I'm following up on this ticket because I haven't heard anything back from you in a few days. Do you have any further information regarding the questions I asked last week? Will you be able to perform any of the experiments I suggested (I understand that you might not be able to)? I'll be working again today to try to reproduce your issue, so any additional information you can provide would be valuable. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 13/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de - So far, I have not been able to reproduce the behavior you have observed, though I will keep trying. In an effort to continue to gather data, would you be in a position to run either or both of the following experiments:
The first test will let us see if this issue reproduces in both SCCC and CSRS environments, and the second will bisect some of the range over which this apparent regression exists. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 13/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de - I overlooked that you had downgraded mongos from 3.2.12 to 3.2.8 in the cluster I named as "Bad" above. Presumably, the "Bad" cluster is actually working fine for you at that version, and you only reproduce the ExceededTimeLimit errors if you advance mongos to 3.2.10 or 3.2.12. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 13/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de - Thank you for providing that. I'd like to clarify/confirm one thing. Am I correct that in both the "good" cluster, and the "bad" cluster, all of the mongos instances are running 3.2.8? The differences between the clusters (ignoring, for the moment sizes, hardware variations, VMs, etc.), based on your notes above, appear to be: The "Good" cluster:
The "Bad" cluster:
In other words, the version of mongos does not vary between the good cluster and the bad cluster, but the version of mongod does. Is that correct, or have I misinterpreted your above writeup somehow? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 12/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi acm, thank you for your quick reply. I'm happy to help to get this bug fixed.
All our mongodb clusters are running with authentication (per keyfile) and storage engine wired tiger. Our cluster is in mms. Our group name is idealo. The cluster name is offerStoreFR - but mms shows the new and the old cluster wrongly as the same cluster consisting of 7 shards. This is wrong! The old cluster (running mixed mongod's v3.2.10 and 3.2.11 and mongos v3.2.8, being in production) is constituted of the following shards:
The new cluster (running mongod's v3.2.12 and at the time mongos v3.2.12, downgraded now to v3.2.8, not yet in production because it's still syncing from the old cluster) is constituted of the following shards:
Also, mms wrongly shows only the 3 router and configservers of the new cluster. I've just discovered that mms shows the router and config servers of the old cluster under the cluster name "Cluster_4". Example offer document, sensible content replaced by X:
Index stats in Bytes of offer collection for only 1 of 4 shards, sensible content replaced by XXX:
Stats in Bytes of offer collection for all 4 shards:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 12/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de - Another potentially interesting bit of information would be whether your reproducer above exhibits the high latencies with 3.2.9, as that would help narrow the search space from 3.2.8..3.2.10 to one of 3.2.8..3.2.9 or 3.2.9..3.2.10, even before I am able to get a local reproduction. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 12/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de - Thank you for the detailed writeup. I am going to attempt to reproduce what you are seeing on my side. If I am able to do so, it should be fairly easy to do a bisect between 3.2.8 and 3.2.12 to understand what change has caused this performance regression. In an effort to produce as faithful a reproduction environment as possible, could you please provide some additional information? I do apologize if this has been asked and answered elsewhere already, but as there is a somewhat muddled ticket history here, I think it would be good to ensure that this information is all captured in one place:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 12/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi acm, to answer your question, yes, the cluster behavior with respect to ExceededTimeLimit errors in 3.2.12 is identical to what we observed in 3.2.10. This is true as well as from a server side point of view because not only the error messages are identical, also the abnormally high tcp-timewait spikes are identical. It's also true from a client side point of view because execution times of find queries are abnormally high or even timed out. We found out that mongos v3.2.8 does not have this issue. We know that v3.2.8 may crash due to
Here is v3.2.8:
As you can see, the latencies of mongos v3.2.8 are never above 25 ms whereas mongos v3.2.12 may reach latencies of over 200 ms. Please see also the screenhot tcp_timewait_3.2.12vs3.2.8.jpg which clearly shows that the tcp-timewait spikes were much higher (nearly 500) for v3.2.12 compared to the tcp-timewaits of v3.2.8 (max 72). You can see the timestamps on the x-axis and correlate them with the timestamps of my tests when the mongos reconnected due to monogs restart. Right now, we are using mongos v3.2.8 in production. The difference to mongos v3.2.12 is more than obvious: please compare both screenshots of tcp-timewaits v3.2.8_tcp_tw.jpg versus v3.2.12_tcp_tw.jpg. This are screenshots of our production system. The maximum of v3.2.8 is 4 whereas the max of v3.2.12 is 6000! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 11/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi anonymous.user, here are the values you've asked for:
Thank you. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 11/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I understand you are using debian71, would you please provide the following output so we can rule out some theories that only affect specific versions of glibc?
Thank you, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 11/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi kay.agahd@idealo.de - I'll be filling in for Jason for a little while as he is currently unavailable. I'd like to take a step back from the prior discussion and ask a higher level question. Previously, you had filed Concurrently with the above, The fix for I ask all this because if the type of issue fixed in I have asked the engineers who worked on | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 09/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks antogiann for the suggestion. We tried it out but it did not solve the problem. We still had ExceededTimeLimit errors. mira.carey@mongodb.com you mean that you're interested in seeing what our mongos traffic in the v3.2 looks like. Well, since v3.2 mongos routers are simply not operational, we can't use them in production, so we can't give you more production logs. However, we could give you our logs from our mixed mongo cluster (monogd's = v3.2.10 and v3.2.11, configServer SCCC and mongos'es = v3.0.12). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Antonis Giannopoulos [ 09/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
On the instance we managed to reproduce
Unfortunately in production the same settings didn't had the desired result. I assume is not related with the load in the shards as my benchmark stress the instance with 10X more traffic than the production. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mira Carey [ 05/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Responding inline below:
Thanks for the context, that makes sense.
I'd be interested in seeing what your mongos logs look like when that's the only traffic in your 3.2 cluster. What I see in your attached logs is a lot of parallel clients, which is a different (and also important to us) issue.
I don't disagree with you, you're definitely running into a different operational profile between 3.0 and 3.2. My experience with 3.2 mongos has been that it's much more sensitive to spikes in demand than 3.0. Where 3.0 used a few more threads, and had a few more locks, 3.2 tends to jump on a request and immediately spin up a lot of connections. Most of the pain I've seen in 3.2 deployments has been a result of systems that are stable when load is regular and even, but overwhelms particular shards when load spikes. In looking at your logs, I'm trying to figure out if the exceededTimeLimit's you're seeing are real and correct (in which case the problem lies in how mongos spools up), or if they're false (firing too early, logging incorrectly, etc). Without expanded logging, I have to make that inference based on client load and whether successful connects are mixed in. In this case, you do have load, you do have successful connects mixed in and there's nothing overly suspicious in there. That inclines me to suggest some of the operational workarounds that have worked for other users. For the other vein, we've backported | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 05/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi mira.carey@mongodb.com thank you for your detailed analysis! Let me answer your questions and explain our workload. Your observation of client spikes from 15:17 to 15:36 is expected because at this time we moved the client application from accessing the old mongodb cluster to the new mongodb cluster. In general, our mongos'es have around 2k-3k connections each. Some mongos routers running mongodb v3.0.12 had already spikes upto 6k connections without any ExceededTimeLimit exception. Please see our tcp_timewait screenshot "tcp-tw_v3.0.12.jpg" of a mongos router running v3.0.12 which handles even more connections than the blocking router running v3.2.12 for which we posted yesterday the screenshot "fr-11_tcpwaitOnly.jpg". Right now, since the new mongodb v3.2.12 cluster is not operational, we moved back our client application to the old cluster which does not encounter any of the blocking behaviours seen on the new v3.2.12 cluster. As you can see in "Steps to reproduce" above, the query which may block by throwing ExceededTimeLimit errors in v3.2.12 is querying the _id field, which is also the shardkey. We can reproduce the issue by using another indexed, unique field, called "offerId". We strongly think that this issue has to do with mongodb routers running v3.2.x because we are running 6 different mongodb clusters of over 100 bare metal machines, all having the same workload but only those which are using mongos routers v3.2 encounter ExceededTimeLimit errors. This makes it impossible for us to use mongodb v3.2 in production. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mira Carey [ 05/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm taking another look at the logic around how we handle ExceededTimeLimit calls, but in the meanwhile I have some observations, and some setparameters I'd like you to try if you can. Observations:Load seems highYou're seeing a lot of new clients in a small amount of time. At the point at which ExceededTimeLimit starts to dominate the logs (15:36), we're at 994 connections open to the mongos. As recently as 15:17, 20 minutes prior, the number of clients was stable at less than 100. There's also a lot of churn going on, with 1271 new clients in the minute of 15:33 (note that this is prior to exceeded time limit showing up substantially in the logs) and 923 retiring ones. This speaks to a workload more complex than one connection making a sequence of queries. That load is generating a lot of outbound connectionsAll of those clients are triggering a lot of new connections to various mongods. Going back to 15:33 (before we began to see connection timeouts), we see as many as 210 new connections to the 'mongo-066' mongod in a single minute. That isn't long enough for any of those connections to exit the pool via regular expiration. It also lines up with the 350 net new clients we gained in that minute (assuming poor connection re-use, perhaps because the burst in traffic was overloading the various mongods and preventing them from returning quickly). All of the ExceededTimeLimit calls you see, happen on one particular TaskExecutorDespite work being well distributed across the various executors early in the logs, by 15:36, the bulk of the connecting, and failing to connect, occurs in 'NetworkInterfaceASIO-TaskExecutorPool-0-0'. The fact that the other pools don't see similar log messages, probably points to something of a death spiral. In this case, I don't believe that it's the same behavior as in The way connection timeouts work goes something like this:
That can clearly exasperate an already bad problem, as one pool lacks enough connections to satisfy it's requests, and keeps loading the server with new connections to accept from old and new clients. That problem is acute, and comes in waves, because a set of connections that timeout around the same time will generate new connection attempts at the same time. Some questions
If the spike was unexpected, you many be seeing a situation where your connection pools are too large in your client application. Things normally hum along fine, but as soon as a network blip comes in, or a mongod becomes unresponsive for a moment (maybe a step down), many new connections are suddenly established, that previously weren't active (as most requests could be served by the same smaller pool of connections in serial). Some suggestionsLimit the number of mongos -> mongod connectionsIf the problem is that you're introducing many new clients, which are creating many connections to various mongods, that in turn load those machines, then you might benefit from setting ShardingTaskExecutorPoolMaxSize. It constrains the maximum number of connections in any given pool in mongos to any given host. This is viable only if you know that most operations can be expected to be short lived (as setting this to a low number could allow long running queries to eventually crowd out and starve short running operations). It interacts with TaskExecutorPoolSize (The number of connection pools to use for a given mongos) and defaults to 4 <= NUM_CORES <= 64. It appears that your mongos' run on 4 core boxes, so you probably don't need to tune it, but it affects the math for choosing the poolMaxSize. Using something like lsof, to determine the number of stable connections from a particular mongos to any given mongod, you could divide that number by 4 (to find the cap per pool), then add in a bit of a fudge factor (perhaps 2-3x) and see if limiting connection churn fixes your problem. That should insulate you from having networking/perf glitches turn into connection storms. Effectively, this solution involves using mongos to perform soft admission control / queueing rather than attempting to satisfy many requests in parallel. Make the connection timeout longerIf your system is capable of re-stabilizing from a high number of active clients on mongos (and many active connections on the mongod), you may be able to simply raise the timeout on connect. By changing ShardingTaskExecutorRefreshTimeoutMS from it's default of 20000, to something more generous, like 60000, you may be able to weather these bursts of traffic. This works if your problem is bursts of traffic, followed by relatively idle periods, and that the queries you're running having timeouts >20 seconds today. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 05/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jonhyman, just for info, in contrast to | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 05/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks ramon.fernandez for your fast reply. I've uploaded the first 70000 lines of the router log file as router-11.log.1.tgz to your upload portal. They contain 736 ExceededTimeLimit error lines. Let me know if you need further logs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jon Hyman [ 05/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is likely related to our ticket at | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 05/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry your cluster is having issues kay.agahd@idealo.de, I've created a secure upload portal for you to upload the logs. I'm thinking that a section of the logs from a mongos startup until the first few ExceededTimeLimit messages appear should be a good start, but feel free to upload more if you want. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 04/Apr/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The log file of the router is already over 600 MB. If you need it, where can we upload it? |