[JAVA-452] allow to modify the connection timeout for the maintenance thread Created: 18/Oct/11  Updated: 18/Aug/13  Resolved: 18/Aug/13

Status: Closed
Project: Java Driver
Component/s: Cluster Management
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Antoine Girbal Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

http://groups.google.com/group/mongodb-user/browse_thread/thread/edc5445822ca5f4c

right now it's set to 20s and cannot be modified.
This may be too long for large replica sets.
Servers are tested sequentially and if a few are unreachable it can delay master failover etc.
Could be set by java property or do min(10s, MongoOptions.timeout)



 Comments   
Comment by Jeffrey Yemin [ 18/Aug/13 ]

A system property was added for this a long time ago.

Comment by Jeffrey Yemin [ 10/Mar/12 ]

Examined the source code a bit more closely, and it looks like you can control the timeouts for the replica set maintenance thread:

_mongoOptionsDefaults.connectTimeout = Integer.parseInt(System.getProperty("com.mongodb.updaterConnectTimeoutMS", "20000"));
_mongoOptionsDefaults.socketTimeout = Integer.parseInt(System.getProperty("com.mongodb.updaterSocketTimeoutMS", "20000"));

This is not the circuit breaker that you described, but it gives you some control at least.

Comment by Antoine Girbal [ 18/Oct/11 ]

The maintenance thread runs in the background, so regular queries do not wait on it unless it's the 1st query ever to the driver.
Other queries will use valid master / slaves as detected by the maintenance thread.
So really the issue here is how long it can take to detect changes in server status, if some of the servers are not reachable.

That being said I would not be surprised if some of our error handling code still try to update the master themselves, which could trigger a thread buildup as you describe.

Comment by Fabio Pugliese Ornellas [ 18/Oct/11 ]

Hello,

The root problem I see, is that if you have some MongoDB unavailability (let's say due to a network issue), your queries will go from a couple tenths of milliseconds (eg. 20ms), to a few seconds (eg. 2s timeout). Since your QPS will be the same (users accessing the site), your concurrent connections will also be multiplied by the same factor (Little's law), in this case, by 100x. This will usually break all thread limits, from load balancer, Apache, Jetty, etc, and make the whole app unavailable. If you use an aggressive timeout, you might drop legit requests, since some small % of them will be slower at regular operation.

Michael Nygard at his book "Release It" describe a way to fix this: implement a circuit breaker. This is, if the external resoruce is out, you fail fast, and give an immediate error. Even during an outage, you won't be slower, in fact, you will be faster. This avoids reaching the thread limits.

Implementing a configurable timeout is better than having a 10s default, but might not holt the app up during an outage.

Cheers.

Generated at Thu Feb 08 08:52:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.