[SERVER-37752] possible bug in the server (ignoring timeout, using hardcoded timeout,..) ? Created: 25/Oct/18  Updated: 01/Nov/18  Resolved: 01/Nov/18

Status: Closed
Project: Core Server
Component/s: Internal Code, Networking, Replication, Sharding
Affects Version/s: 3.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Adamcik Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongos.log    
Operating System: ALL
Steps To Reproduce:

i cant imagine. its a very big project.

Participants:

 Description   

hello together,
im experiencing weird timeouts from mongodb or the perl driver, but i guess its the server. i already tried to debug and track down whats happening(also searched the perl driver for any occurence of this timeout)

it seems its some hardcoded 20 second timeout..

i got to this conclusion through my scripts.. they end up like this:

[Thu Oct 25 01:26:57 2018] update_one @ collection foo..
MongoDB::WriteError: Write results unavailable from 192.168.91.187:5000 due to Couldn't get a connection within the time limit
script exit at: Don Okt 25 01:27:18 CEST 2018

and that works exactly like this:

print "[" . scalar localtime . "] collection (foo)..\n";
my $col=$con->get_database("foo")->get_collection("foo");

print "[" . scalar localtime . "] update_one @ collection foo..\n";
$col->update_one({"foo" => foo},{'$addToSet' => { "foo" => foo }}); <= kills the script

so the difference between the update call and the script exit is everytime 20 seconds..

it happens not only with addtoset.

i already connect to the the server with very high timeout values (for exact this debug purpose):

$con= MongoDB::MongoClient->new(
host => $constr,
connect_timeout_ms => 1000 * 60 * 10, #10 minutes
max_time_ms => 1000 * 60 * 60 * 3, #3 hours (max time for operation)
server_selection_timeout_ms => 1000 * 60 * 60 * 1, #1 hour
socket_timeout_ms => 1000 * 60 * 60 * 4, #4 hours (wait for reply)
server_selection_try_once => 0,
w => 1,
wtimeout => 1000 * 60 * 60 * 1, #1 hour
j => 0 );

so the question is, where does this timeout come from and what can i do, except from try/catch to prevent these errors?

im using MongoDB 3.6.5, and the Perl Driver 2.0.1 at Debian Stable.

kind regards,
Michael



 Comments   
Comment by Danny Hatcher (Inactive) [ 01/Nov/18 ]

Hello Mike,

I'm glad to hear it. Please note that these timings are longer than what most of our customers use so you may wish to investigate you infrastructure to determine if there are any upgrades that you can make to improve performance.

As this is not a bug in MongoDB, I will close this SERVER ticket.

Thank you,

Danny

Comment by Michael Adamcik [ 01/Nov/18 ]

Hello Danny,

Im testing now with a second mongos instance having adjusted parameters:

shardedConnPoolIdleTimeoutMinutes=10
globalConnPoolIdleTimeoutMinutes=10
ShardingTaskExecutorPoolRefreshTimeoutMS=60000
ShardingTaskExecutorPoolMinSize=4

Seems to solve the issue so far.

 

Kind regards,

Mike

Comment by Danny Hatcher (Inactive) [ 29/Oct/18 ]

Hello Michael,

Yes it appears that this is not an issue with the Perl driver and is instead a problem with overloaded `mongod` instances. As you pointed out in your original description that the timeout was always 20 seconds, I would start with shardingTaskExecutorPoolRefreshTimeoutMS. If you change that to 30 seconds and then start seeing the errors in your application at a 30 second interval, that should prove it.

Can you test that change now?

Thanks,

Danny

Comment by Michael Adamcik [ 27/Oct/18 ]

Hello again,

a little update:

i know my mongod instances experience some heavy-load situations, where the storage becomes the bottleneck (high iowait)..maybe in such situations the mongod processes become unable to accepting new connections.

so, after seeing the log, im trying now some options from https://docs.mongodb.com/v3.6/reference/parameters/#sharding-parameters (connection count, idle time,...)

if they help, its not a bug i guess....

 

your help and support is still highly appreciated!

Kind regards,

michael

Comment by Michael Adamcik [ 26/Oct/18 ]

Hello Danny, 

it just happend again, i uploaded the log of the mongos instance. The fun begins at 2018-10-26T20:53:52

Thank you,

Michael

Comment by Danny Hatcher (Inactive) [ 26/Oct/18 ]

Hello Michael,

Unfortunately, I am unable to reproduce this problem on my own environment. Are you able to reproduce this issue consistently with a test script or is it something that you are only occasionally seeing as part of your main scripts? If you are connecting via a mongos node, it is possible that the error is actually at that level and is simply bubbling up to the driver. As you mentioned, I believe the mongos and mongod logs will be the key to resolving this problem. Can you upload a copy of the problem mongos and Primary mongod logs if it happens again?

Thank you,

Danny

Comment by Michael Adamcik [ 25/Oct/18 ]

sorry for no logs..

i will add server loging to my mongos and mongod instances, maybe they will give some useful information... could take a week.. these errors dont appear often.. 

Generated at Thu Feb 08 04:46:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.