[SERVER-26654] ExceededTimeLimit: Operation timed out after uprgade from 3.2.8 to 3.2.10 Created: 17/Oct/16  Updated: 11/Apr/17  Resolved: 11/Nov/16

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Antonis Giannopoulos Assignee: Kaloian Manassiev
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 201.zip     Zip Archive 202.zip     Zip Archive 203.zip     File log-20161012-app0.gz     File log-20161012-node0-49631.gz     HTML File sysbench-config    
Issue Links:
Duplicate
duplicates SERVER-26723 Mongos stalls for even simple queries Closed
Related
is related to SERVER-26859 AsyncResultsMerger replica set retarg... Closed
Operating System: ALL
Sprint: Sharding 2016-11-21
Participants:
Case:

 Description   

We are running our production cluster using version 3.2.8, Our cluster consists of four mongos , three config servers and ten shards-3-node-replica-sets , we perform both primary and secondary reads. Very frequently we hit SERVER-26159 and our mongos crashes. According to the bug report the issue was solved on 3.2.10 and we upgraded on Oct 11. 10 minutes after the upgrade completed we start receiving errors that the mongos can't connect to the shards due to timeout. Timeouts were very frequent and reduce our processing throughput to 20%. We forced to rollback to 3.2.8 and the issue gone away, we don't receive any timeout but the SERVER-26159 returned.

Oct 11 06:40:03 [NetworkInterfaceASIO-TaskExecutorPool-1-0] Successfully connected to node1-48668:48668
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-1-0] Successfully connected to node0-49179:49179
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-4-0] Connecting to node0-49627:49627
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-1-0] Failed to connect to node0-49630:49630 - ExceededTimeLimit: Operation timed out
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-4-0] Connecting to node1-49631:49631
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-4-0] Connecting to node0-48668:48668
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-4-0] Connecting to node0-48574:48574
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-1-0] Failed to connect to node0-49631:49631 - ExceededTimeLimit: Operation timed out
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-1-0] Failed to connect to node0-49631:49631 - ExceededTimeLimit: Operation timed out
Oct 11 06:40:03 mongos16026 mongos.16026[27341]: [NetworkInterfaceASIO-TaskExecutorPool-1-0] Failed to connect to node1-48574:48574 - ExceededTimeLimit: Operation timed out

Is this a known bug in 3.2.10 and is there any workaround?



 Comments   
Comment by Alessandro Gherardi [ 12/Nov/16 ]

Here's the log files, broken down per server.

The environment consists of 3 Windows 2012 servers. There is a single replicaset with 3 mongoD's - one on each server - and CSRS with 3 config servers - one on each server . There is a mongoS on each server . The IP's are 192.168.20.201 (mongod0-rs1-GUID and configsvr1-GUID), 192.168.20.202 (mongod2-rs1-GUID and configsvr2-GUID), and 192.168.20.203 (mongod1-rs1-GUID and configsvr3-GUID).

Notice that I rebooted the servers prior to running the test - this is a test environment - so connection failures earlier in the logs are expected.

Comment by Kaloian Manassiev [ 11/Nov/16 ]

antogiann I am closing this ticket and moving the investigation to happen in SERVER-26723.

Comment by Kaloian Manassiev [ 10/Nov/16 ]

Yes, I meant the mongod node to which mongos failed to connect. For example, this would be node node0-49630:49630 if this was the error message:

Failed to connect to node0-49630:49630 - ExceededTimeLimit: Operation timed out

-Kal.

Comment by Alessandro Gherardi [ 10/Nov/16 ]

Kal,

>>>> the log from the host against which the call was made?
You mean the mongoD log?

>>>> would it be possible that you try to use 3.2.11-rc0 again
This weekend, I should have access to the test environment where we noticed this issue. I'll retry 3.2.11-rc0 and attach the info you requested to this ticket

Thanks,
Alessandro

Comment by Kaloian Manassiev [ 10/Nov/16 ]

Thank you antogiann and agherardi for trying out 3.2.11-rc0. It is unfortunate that our hypothesis was not correct and also we are unable to reproduce it while running sysbench in AWS like antogiann suggested.

However, in 3.2.11-rc0 we improved the logging around network timeouts, which should be useful in tracking this down.

agherardi, would it be possible that you try to use 3.2.11-rc0 again and increase the log verbosity for the command, network and asio components using the calls below, until the issue reproduces and then include the mongos log and the log from the host against which the call was made?

This is the command to increase log verbosity:

db.adminCommand({ setParameter: 1, logComponentVerbosity: { command: { verbosity: 1 } } });
db.adminCommand({ setParameter: 1, logComponentVerbosity: { network: 1 } });
db.adminCommand({ setParameter: 1, logComponentVerbosity: { network: { asio: { verbosity: 2 } } } });

And this is the command to bring it back to zero:

db.adminCommand({ setParameter: 1, logComponentVerbosity: { command: { verbosity: 0 } } });
db.adminCommand({ setParameter: 1, logComponentVerbosity: { network: 0 } });
db.adminCommand({ setParameter: 1, logComponentVerbosity: { network: { asio: { verbosity: 0 } } } });

Thank you very much in advance for your help.

-Kal.

Comment by Antonis Giannopoulos [ 10/Nov/16 ]

Hi Kal,

I tested using the same methodology against 3.2.11-rc0 few minutes ago and the issue is still there.

Thanks,
Antonis

Comment by Roy Reznik [ 10/Nov/16 ]

We have the same issue with 3.2.10.

Comment by Alessandro Gherardi [ 09/Nov/16 ]

We have the same problem. We just tried 3.2.11-rc0, same issue.

Comment by Antonis Giannopoulos [ 09/Nov/16 ]

Thanks Kal,

I will try mongo-sysbench configuration that reproduced the issue in 3.2.10 against 3.2.11-rc0 (when its out) and share the results with you.

I will check the repo daily but It would be great if you give me a heads up when its out.

Thanks,
Antonis

Comment by Jon Hyman [ 09/Nov/16 ]

Okay.

Comment by Kaloian Manassiev [ 09/Nov/16 ]

Hi jon@appboy.com,

Yes - based on the symptoms described in this ticket we expect that SERVER-26859 should fix them.

The 3.2.11-rc0 release candidate is expected to become available in the coming days. In case you are hitting this problem, would it be possible to give it a try and let us know whether it solves it for you?

Thanks in advance.

Best regards,
-Kal.

Comment by Jon Hyman [ 09/Nov/16 ]

Is this fixed now that https://jira.mongodb.org/browse/SERVER-26859 is done?

Comment by Kaloian Manassiev [ 01/Nov/16 ]

Thanks antogiann for the information. Unfortunately these filtered-out logs do not give us enough information about what is causing the connections to time-out, but I am now working on reproducing the problem using your sysbench configuration.

Also, from a different report with similar symptoms we now have a sense of what might be going on. The problem is described in SERVER-26859 and we are now working on getting it fixed and ported to 3.2.

Please watch this ticket for further updates.

Comment by Antonis Giannopoulos [ 25/Oct/16 ]

Hi Kal,

I manage to replicate the error using mongo-sysbench. My setup was six "sbtest" collections with 40 million documents each sharded on

{ "_id" : "hashed" }

. The sysbench config options may found attached. I used 9 shards of MMAPv1 3.2.10 and config servers on SCCC topology. All components (shards,config,mongos) were running CentOS 6.

Regards,
Antonis

Comment by Antonis Giannopoulos [ 18/Oct/16 ]

Hi Kal,

Thank you for looking at this.

The logs are from one of my mongos but all four reported the same behavior. Our audit team is examine the logs and I am confident that I will be able to share the requested information with you.

Are you running your config servers in SCCC mode or in CSRS? – SCCC , it was on our plans to change to CSRS after the 3.2.10 upgrade.

Did you start getting the errors before you upgraded the mongos instances or only after they were upgraded? – The error started 10 minutes after we upgraded the mongos to 3.2.10. When we revert to 3.2.8 the error went away.

Would it be possible to temporarily increase the network logging verbosity on one of the mongos instances where this problem reproduces so we can capture what exact operations were causing this timeout? – I will try to reproduce the issue with "network.verbosity"=2 and provide you the output.

Regards,
Antonis

Comment by Kaloian Manassiev [ 17/Oct/16 ]

Hi Antonis,

From the log snippet, which you have attached it seems that the errors are coming from a mongos instance - is this correct?

Would it be possible to attach the logs from one such mongos instance around the time of the errors and the logs from the mongod instance running at node0-49631 from the same time so we can see what was happening?

I also have a couple of questions:

  • Are you running your config servers in SCCC mode or in CSRS?
  • Did you start getting the errors before you upgraded the mongos instances or only after they were upgraded? I.e., were the errors happening when the shards were running 3.2.10 and mongos at 3.2.8?
  • Would it be possible to temporarily increase the network logging verbosity on one of the mongos instances where this problem reproduces so we can capture what exact operations were causing this timeout?
    The command to do this from the shell is:

    db.setLogLevel(2, 'systemLog.component.network.verbosity');
    

    after a few errors have been collected, you can turn the verbosity back to 0 using the following command:

    db.setLogLevel(0, 'systemLog.component.network.verbosity');
    

Thank you in advance.

Best regards,
-Kal.

Generated at Thu Feb 08 04:12:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.