[JAVA-4288] Allow configuration of MAX_CONNECTING on the connection pool after 4.3.x changes Created: 01/Sep/21 Updated: 04/May/22 Resolved: 15/Oct/21 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Connection Management |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor - P4 |
| Reporter: | Marc Bridner | Assignee: | Valentin Kavalenka |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | external-user | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Documentation Changes: | Not Needed | ||||||||||||||||||||
| Description |
|
Hi After migrating from 4.2.1 to 4.3.1 with the connection pool changes made in Our setup used to be (we have 60 worker threads in our app):
Due to external factors we need to limit our connection lifetime to 60 seconds, and we have high throughput on our connections, and we need to limit p99 latency spikes which is why we've instructed the driver to time out rather than wait the default 2 seconds. After upgrading to 4.3.1 we started encountering a LOT of timeouts and changed the settings to:
(changed max wait to 50ms, increased min-size and lowered maintenance freq) This takes care of MOST of our time-outs, but we still get the occassional one. It has increased our p99/p100 latency by 50ms, and we're essentially wasting 30 threads per host connecting to the DB. It'd be most appreciated if we could configure MAX_CONNECTING instead of having it as a static final package-private constant. |
| Comments |
| Comment by Valentin Kavalenka [ 15/Oct/21 ] |
|
Thank you for all the information. The performance bug ( We got a request similar to yours for a different driver ( |
| Comment by Marc Bridner [ 28/Sep/21 ] |
|
Hi Yes, we do storm the server somewhat, but ideally the storm should spread out over all 60 seconds, we don't refresh ALL connections in the same instant. We never ran into any starvation issues before the max connecting limit. We only have 6 hosts though, if we had 600 hosts it'd be an entirely different story. More eager population of the connection pool would be ideal, if there's a guarantee that the minimum is actually maintained (except for network issues). Re-creating a connection on termination check-in would work too, if that was done on a thread other than the ones serving requests. Avoiding any connection management on request threads is important in order to have stable and predictable latencies. Our work-load are all .find()'s that complete in p50 <3ms p99 <5 ms (at 12000 TPS), so I assume the duration of connections being checked out is p99 <5ms as well. I unfortunately do not have any reliable metrics on how long connection setup takes. Each of our servers run 60 worker threads, and during load tests they're 95% occupied. I don't know have any more detailed distribution metrics for concurrent check-outs. The 60 second limit is artificial due to internal company network structure, the IP we connect to is of an AWS Network Load Balancer with 1 host (the primary database instance) that has a hard limit of 350 seconds for one connection. We elected to do limit our connection to 60 seconds, it'll result in the same problem regardless of what value we pick. Having a load balancer with 1 host doesn't make much sense, but we have to use AWS VPC Endpoints to connect to this NLB from our own VPC due to company InfoSec policies as well as topology choices - we have no choice in this.
|
| Comment by Valentin Kavalenka [ 27/Sep/21 ] |
|
Thank you for reporting the issue you encountered. Allowing to configure the max connecting limit, essentially allows to disable the limit and change the driver behavior back to when it could storm servers with new connection requests without being throttled. We would prefer to find other ways of improving latencies in situations like yours. You mitigated the problem to an extent by changing the driver configuration. One of the additional ideas is expressed in Since it is not obvious what idea / a combination of ideas of mitigating p99 latency may help and to what extent, we would like to know more about your scenario of using the driver (note that this does not imply a commitment to implement any such idea). The most useful things to know are probably
Of course, we are not asking for the actual probability distribution, but rather some information (different percentiles, mean/median/mode, etc.) that sheds light on the timings and the level of concurrency in your scenario. Another question I have is about limiting the connection lifetime to 60 seconds. What causes this? Effectively, your application regularly storms the servers with requests to establish new connections, which is a surprising approach per se, and even more so taking into the account your sensitivity to latencies. |