[SERVER-26654] ExceededTimeLimit: Operation timed out after uprgade from 3.2.8 to 3.2.10 Created: 17/Oct/16 Updated: 11/Apr/17 Resolved: 11/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Sharding |
| Affects Version/s: | 3.2.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Antonis Giannopoulos | Assignee: | Kaloian Manassiev |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Sharding 2016-11-21 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
We are running our production cluster using version 3.2.8, Our cluster consists of four mongos , three config servers and ten shards-3-node-replica-sets , we perform both primary and secondary reads. Very frequently we hit
Is this a known bug in 3.2.10 and is there any workaround? |
| Comments |
| Comment by Alessandro Gherardi [ 12/Nov/16 ] | ||||||
|
Here's the log files, broken down per server. The environment consists of 3 Windows 2012 servers. There is a single replicaset with 3 mongoD's - one on each server - and CSRS with 3 config servers - one on each server . There is a mongoS on each server . The IP's are 192.168.20.201 (mongod0-rs1-GUID and configsvr1-GUID), 192.168.20.202 (mongod2-rs1-GUID and configsvr2-GUID), and 192.168.20.203 (mongod1-rs1-GUID and configsvr3-GUID). Notice that I rebooted the servers prior to running the test - this is a test environment - so connection failures earlier in the logs are expected. | ||||||
| Comment by Kaloian Manassiev [ 11/Nov/16 ] | ||||||
|
antogiann I am closing this ticket and moving the investigation to happen in | ||||||
| Comment by Kaloian Manassiev [ 10/Nov/16 ] | ||||||
|
Yes, I meant the mongod node to which mongos failed to connect. For example, this would be node node0-49630:49630 if this was the error message:
-Kal. | ||||||
| Comment by Alessandro Gherardi [ 10/Nov/16 ] | ||||||
|
Kal, >>>> the log from the host against which the call was made? >>>> would it be possible that you try to use 3.2.11-rc0 again Thanks, | ||||||
| Comment by Kaloian Manassiev [ 10/Nov/16 ] | ||||||
|
Thank you antogiann and agherardi for trying out 3.2.11-rc0. It is unfortunate that our hypothesis was not correct and also we are unable to reproduce it while running sysbench in AWS like antogiann suggested. However, in 3.2.11-rc0 we improved the logging around network timeouts, which should be useful in tracking this down. agherardi, would it be possible that you try to use 3.2.11-rc0 again and increase the log verbosity for the command, network and asio components using the calls below, until the issue reproduces and then include the mongos log and the log from the host against which the call was made? This is the command to increase log verbosity:
And this is the command to bring it back to zero:
Thank you very much in advance for your help. -Kal. | ||||||
| Comment by Antonis Giannopoulos [ 10/Nov/16 ] | ||||||
|
Hi Kal, I tested using the same methodology against 3.2.11-rc0 few minutes ago and the issue is still there. Thanks, | ||||||
| Comment by Roy Reznik [ 10/Nov/16 ] | ||||||
|
We have the same issue with 3.2.10. | ||||||
| Comment by Alessandro Gherardi [ 09/Nov/16 ] | ||||||
|
We have the same problem. We just tried 3.2.11-rc0, same issue. | ||||||
| Comment by Antonis Giannopoulos [ 09/Nov/16 ] | ||||||
|
Thanks Kal, I will try mongo-sysbench configuration that reproduced the issue in 3.2.10 against 3.2.11-rc0 (when its out) and share the results with you. I will check the repo daily but It would be great if you give me a heads up when its out. Thanks, | ||||||
| Comment by Jon Hyman [ 09/Nov/16 ] | ||||||
|
Okay. | ||||||
| Comment by Kaloian Manassiev [ 09/Nov/16 ] | ||||||
|
Hi jon@appboy.com, Yes - based on the symptoms described in this ticket we expect that The 3.2.11-rc0 release candidate is expected to become available in the coming days. In case you are hitting this problem, would it be possible to give it a try and let us know whether it solves it for you? Thanks in advance. Best regards, | ||||||
| Comment by Jon Hyman [ 09/Nov/16 ] | ||||||
|
Is this fixed now that https://jira.mongodb.org/browse/SERVER-26859 is done? | ||||||
| Comment by Kaloian Manassiev [ 01/Nov/16 ] | ||||||
|
Thanks antogiann for the information. Unfortunately these filtered-out logs do not give us enough information about what is causing the connections to time-out, but I am now working on reproducing the problem using your sysbench configuration. Also, from a different report with similar symptoms we now have a sense of what might be going on. The problem is described in Please watch this ticket for further updates. | ||||||
| Comment by Antonis Giannopoulos [ 25/Oct/16 ] | ||||||
|
Hi Kal, I manage to replicate the error using mongo-sysbench. My setup was six "sbtest" collections with 40 million documents each sharded on { "_id" : "hashed" }. The sysbench config options may found attached. I used 9 shards of MMAPv1 3.2.10 and config servers on SCCC topology. All components (shards,config,mongos) were running CentOS 6. Regards, | ||||||
| Comment by Antonis Giannopoulos [ 18/Oct/16 ] | ||||||
|
Hi Kal, Thank you for looking at this. The logs are from one of my mongos but all four reported the same behavior. Our audit team is examine the logs and I am confident that I will be able to share the requested information with you. Are you running your config servers in SCCC mode or in CSRS? – SCCC , it was on our plans to change to CSRS after the 3.2.10 upgrade. Did you start getting the errors before you upgraded the mongos instances or only after they were upgraded? – The error started 10 minutes after we upgraded the mongos to 3.2.10. When we revert to 3.2.8 the error went away. Would it be possible to temporarily increase the network logging verbosity on one of the mongos instances where this problem reproduces so we can capture what exact operations were causing this timeout? – I will try to reproduce the issue with "network.verbosity"=2 and provide you the output. Regards, | ||||||
| Comment by Kaloian Manassiev [ 17/Oct/16 ] | ||||||
|
Hi Antonis, From the log snippet, which you have attached it seems that the errors are coming from a mongos instance - is this correct? Would it be possible to attach the logs from one such mongos instance around the time of the errors and the logs from the mongod instance running at node0-49631 from the same time so we can see what was happening? I also have a couple of questions:
Thank you in advance. Best regards, |