[JAVA-4288] Allow configuration of MAX_CONNECTING on the connection pool after 4.3.x changes Created: 01/Sep/21  Updated: 04/May/22  Resolved: 15/Oct/21

Status: Closed
Project: Java Driver
Component/s: Connection Management
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor - P4
Reporter: Marc Bridner Assignee: Valentin Kavalenka
Resolution: Duplicate Votes: 0
Labels: external-user
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by DRIVERS-1943 Make maxConnecting configurable Implementing
Related
related to JAVA-4316 The background thread should be able ... Closed
is related to JAVA-4390 Make maxConnecting configurable Closed
Documentation Changes: Not Needed

 Description   

Hi

After migrating from 4.2.1 to 4.3.1 with the connection pool changes made in JAVA-3927 we find that only allowing 2 new connections at a time is too conservative and would like that value to be configurable.

Our setup used to be (we have 60 worker threads in our app):

.minSize(60)
.maxSize(120) 
.maxConnectionLifeTime(60, TimeUnit.SECONDS)
.maxWaitTime(0, TimeUnit.MILLISECONDS)

Due to external factors we need to limit our connection lifetime to 60 seconds, and we have high throughput on our connections, and we need to limit p99 latency spikes which is why we've instructed the driver to time out rather than wait the default 2 seconds.

After upgrading to 4.3.1 we started encountering a LOT of timeouts and changed the settings to:

.minSize(90)
.maxSize(120) 
.maxConnectionLifeTime(60, TimeUnit.SECONDS)
.maxWaitTime(50, TimeUnit.MILLISECONDS)
.maintenanceFrequency(250, TimeUnit.MILLISECONDS)

(changed max wait to 50ms, increased min-size and lowered maintenance freq)

This takes care of MOST of our time-outs, but we still get the occassional one. It has increased our p99/p100 latency by 50ms, and we're essentially wasting 30 threads per host connecting to the DB.

It'd be most appreciated if we could configure MAX_CONNECTING instead of having it as a static final package-private constant.



 Comments   
Comment by Valentin Kavalenka [ 15/Oct/21 ]

Hi bridner.marc@gmail.com,

Thank you for all the information. The performance bug (JAVA-4316) I identified while thinking about your report is now fixed in 4.3.4 and will be released with the next patch release.

We got a request similar to yours for a different driver (CSHARP-3885), but unlike in your scenario, it does not seem that there is any chance the scenario there can be mitigated with any approach other than either allowing to manually control the max connecting limit, or making it adaptive. The former approach has a chance to be agreed upon, specified and implemented in all drivers that implement the max connecting limit. I am closing this ticket and suggesting you to follow the progress at first via DRIVERS-1943 and then via the JAVA ticket, which will be split from DRIVERS-1943 when it is decided that the change should be implemented in the drivers.

Comment by Marc Bridner [ 28/Sep/21 ]

Hi

Yes, we do storm the server somewhat, but ideally the storm should spread out over all 60 seconds, we don't refresh ALL connections in the same instant. We never ran into any starvation issues before the max connecting limit. We only have 6 hosts though, if we had 600 hosts it'd be an entirely different story.

More eager population of the connection pool would be ideal, if there's a guarantee that the minimum is actually maintained (except for network issues). Re-creating a connection on termination check-in would work too, if that was done on a thread other than the ones serving requests. Avoiding any connection management on request threads is important in order to have stable and predictable latencies.

Our work-load are all .find()'s that complete in p50 <3ms p99 <5 ms (at 12000 TPS), so I assume the duration of connections being checked out is p99 <5ms as well. I unfortunately do not have any reliable metrics on how long connection setup takes. Each of our servers run 60 worker threads, and during load tests they're 95% occupied. I don't know have any more detailed distribution metrics for concurrent check-outs.

The 60 second limit is artificial due to internal company network structure, the IP we connect to is of an AWS Network Load Balancer with 1 host (the primary database instance) that has a hard limit of 350 seconds for one connection. We elected to do limit our connection to 60 seconds, it'll result in the same problem regardless of what value we pick. Having a load balancer with 1 host doesn't make much sense, but we have to use AWS VPC Endpoints to connect to this NLB from our own VPC due to company InfoSec policies as well as topology choices - we have no choice in this.

 

Comment by Valentin Kavalenka [ 27/Sep/21 ]

Hi bridner.marc@gmail.com,

Thank you for reporting the issue you encountered. Allowing to configure the max connecting limit, essentially allows to disable the limit and change the driver behavior back to when it could storm servers with new connection requests without being throttled. We would prefer to find other ways of improving latencies in situations like yours. You mitigated the problem to an extent by changing the driver configuration. One of the additional ideas is expressed in JAVA-4316, another could be to populate the connection pool based on minSize more eagerly: either in anticipation that a connection will soon exceed its maxConnectionLifeTime (this can be done if maxSize - totalConnectionCount > 0), or right away when the pool terminates a connection on check in.

Since it is not obvious what idea / a combination of ideas of mitigating p99 latency may help and to what extent, we would like to know more about your scenario of using the driver (note that this does not imply a commitment to implement any such idea). The most useful things to know are probably

  • The distribution of the durations of time during which connections remain checked out (these durations do not include the checking out time). If this is impossible to provide, maybe you at least could tell us what kind of operations are executed via the Java driver.
  • The distribution of the durations of time it takes to establish a new connection when checking out.
  • The distribution of the number of concurrent checkouts per pool.

Of course, we are not asking for the actual probability distribution, but rather some information (different percentiles, mean/median/mode, etc.) that sheds light on the timings and the level of concurrency in your scenario.

Another question I have is about limiting the connection lifetime to 60 seconds. What causes this? Effectively, your application regularly storms the servers with requests to establish new connections, which is a surprising approach per se, and even more so taking into the account your sensitivity to latencies.

Generated at Thu Feb 08 09:01:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.