[SERVER-26722] router blocks and throws ExceededTimeLimit Created: 21/Oct/16 Updated: 08/Jan/24 Resolved: 08/Dec/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying, Sharding |
| Affects Version/s: | 3.2.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Mira Carey |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Sharding 2016-11-21, Sharding 2016-12-12 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Description |
|
We have updated one of our sharded cluster from v3.0.12 to v3.2.10. Since then our cluster was not operational because many operations got blocked by the the router. The corresponding log message looks like this:
We can reproduce the issue at any time just by executing a findOne through the router several times:
It blocks after a few findOne's already. We found out that mongodb router v3.0.12 doesn't have this problem. This is why we downgraded all our routers to v3.0.12 even though the rest of the cluster (mongod's) is running v3.2.10. Please see attached the log file from the router. |
| Comments |
| Comment by Kay Agahd [ 04/Apr/17 ] | ||||||||||
|
This issue is not resolved. We can reproduce it in v3.2.12.
We observe this behaviour independent on whether the query uses the shardkey or not. In all cases the queried field is indexed. A downgrade of the routers to v3.0.12 ist not possible because our configservers are running as replicaset instead of a mirrored set. Please advise. PS: Since this ticket has been closed already, we've created a new ticket, please see: | ||||||||||
| Comment by Mira Carey [ 08/Dec/16 ] | ||||||||||
|
This is fixed in | ||||||||||
| Comment by Mira Carey [ 01/Dec/16 ] | ||||||||||
|
That's correct, ShardingTaskExecutorPoolRefreshTimeoutMS is the setparameter I'm referring to. The default value is 20000 (I.e. 20seconds * 1000 ms in a second). You'll want to increase that to a value long enough that only a very small number of connections time out. That particular value will depend on your network and it's responsiveness under periods of load. I'd suggest trying 60000 (1 minute) and ramping up from there until the behavior begins to clear up. Note that if you have user operations with timeouts, those may trigger. I.e. if you set a maxTimeMS on a query of 30 seconds, and occasionally it takes 1 minute to connect, those operations will still fail. | ||||||||||
| Comment by Jon Hyman [ 01/Dec/16 ] | ||||||||||
|
Great, thanks. | ||||||||||
| Comment by Mira Carey [ 01/Dec/16 ] | ||||||||||
|
I'm definitely considering this for 3.2 and 3.4, but haven't had a chance to run down which specific minor versions we'll target with our release team. I'll apply specific fix versions as those targets get negotiated | ||||||||||
| Comment by Antonis Giannopoulos [ 01/Dec/16 ] | ||||||||||
|
I am willing to test with an increased "ShardingTaskExecutorPoolRefreshTimeoutMS" on an environment that I have successfully reproduce the error multiple times. I believe ShardingTaskExecutorPoolRefreshTimeoutMS is exposed in 3.2.11 on the mongos configuration file (please confirm):
Can you recommend a starting value for ShardingTaskExecutorPoolRefreshTimeoutMS ? Thanks | ||||||||||
| Comment by Jon Hyman [ 30/Nov/16 ] | ||||||||||
|
I see that you set this as 3.5 Required. This issue occurred in 3.2.10 is there no way that you can back port this to 3.2.x? We're currently stuck on 3.2.8, which has mongoS segfaults, because we were forced to downgrade back after this issue immediately wrecked our application after upgrading to 3.2.10; but we're going to have to wait until 3.6 comes out before this is fixed? I may just not be understanding your Fix Version tag. | ||||||||||
| Comment by Mira Carey [ 30/Nov/16 ] | ||||||||||
|
After some investigation, I've opened The problem is that we have timeouts on connects (which aren't directly linked to operations) which, when they trigger, propagate as timeouts to user operations. And it triggers for all connections currently in the pool. So if you're ever in a situation where you can see connections timeouts (default 20 seconds), you'll see waves of operation timeouts as a result. I believe increasing ShardingTaskExecutorPoolRefreshTimeoutMS to a higher value may be a viable work around. | ||||||||||
| Comment by Kelsey Schubert [ 28/Nov/16 ] | ||||||||||
|
Hi all, We're continuing to investigate this behavior and making progress towards reproducing this issue. Please note that this issue is distinct from Please continue to watch this ticket for updates. Kind regards, | ||||||||||
| Comment by Kay Agahd [ 12/Nov/16 ] | ||||||||||
|
[09:02:26]root@offerstore-en-router-03.ipx /home/admin# uname -a | ||||||||||
| Comment by Kaloian Manassiev [ 11/Nov/16 ] | ||||||||||
|
Thanks agherardi. In your case you are definitely not hitting the glibc deadlock problem. Per our previous conversation, please if possible attach the mongos logs with increased verbosity. Thanks again for your help. -Kal. | ||||||||||
| Comment by Alessandro Gherardi [ 11/Nov/16 ] | ||||||||||
|
We use Windows. | ||||||||||
| Comment by Antonis Giannopoulos [ 11/Nov/16 ] | ||||||||||
|
Linux HOSTNAME 2.6.32-042stab110.1 #1 SMP Fri Jul 3 16:58:07 MSK 2015 x86_64 x86_64 x86_64 GNU/Linux | ||||||||||
| Comment by Roy Reznik [ 11/Nov/16 ] | ||||||||||
|
Linux HOSTNAME 4.4.0-45-generic #66~14.04.1-Ubuntu SMP Wed Oct 19 15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | ||||||||||
| Comment by Kaloian Manassiev [ 11/Nov/16 ] | ||||||||||
|
kay.agahd@idealo.de, agherardi, royrez@microsoft.com, antogiann - We are working on a hypothesis that this could be a bug in glibc. See this comment in the related ticket for more detail. Would it be possible to run "uname -a" on one of the machines where this problem occurs and include the output so we have a few samples to look at? Thank you in advance. -Kal. | ||||||||||
| Comment by Roy Reznik [ 10/Nov/16 ] | ||||||||||
|
We have the same issue with 3.2.10. | ||||||||||
| Comment by Alessandro Gherardi [ 09/Nov/16 ] | ||||||||||
|
We're seeing the same issue when we upgraded from 3.0.4 to 3.2.10. We just tried 3.2.11-rc0 - which supposedly has the patch for https://jira.mongodb.org/browse/SERVER-26859 - but no luck. We use CSRS. | ||||||||||
| Comment by Kay Agahd [ 01/Nov/16 ] | ||||||||||
|
Good to hear, thank you! | ||||||||||
| Comment by Kaloian Manassiev [ 01/Nov/16 ] | ||||||||||
|
Thanks kay.agahd@idealo.de for the information - the original logs and your note about the increased connections in tcp_timeWait state was really helpful. We now have a sense of what is going on and the problem is described in | ||||||||||
| Comment by Kay Agahd [ 26/Oct/16 ] | ||||||||||
|
kaloian.manassiev, I tried to reproduce the issue for over 30 minutes. Up to 10 clients executed in parallel ten thousands of finds in a loop but even if some finds could took up to 10 or even 20 seconds, it never got blocked like on October 20. | ||||||||||
| Comment by Kay Agahd [ 26/Oct/16 ] | ||||||||||
|
Thanks kaloian.manassiev, I'll try to install a v3.2.10 router, reproduce the issue and upload the extended router logs today. | ||||||||||
| Comment by Kaloian Manassiev [ 25/Oct/16 ] | ||||||||||
|
Apologies for the delayed reply and thank you for providing us with the mongos logs. One hypothesis that we are working on is that we have code which performs blocking I/O work in a asynchronous network worker thread. The code in question would be exercised only if a network connection establishment fails for some reason, but once it happens it could block subsequent connection establishment or network calls which happen to use the same thread. The fact that it happens during a find operation makes this hypothesis even more likely. Unfortunately we only log this information at log level 1 so it is not present in these logs. Would it be possible that you perform the mongos upgrade again and run the repro with log level 1 enabled on mongos so we can see whether retries are actually happening? Thanks in advance. Best regards, |