[JAVA-4452] Improve performance characteristics of the connection pool to be on par with the driver 4.2.0 Created: 25/Jan/22 Updated: 28/Oct/23 Resolved: 27/Jan/22 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Performance |
| Affects Version/s: | None |
| Fix Version/s: | 4.5.0 |
| Type: | Improvement | Priority: | Critical - P2 |
| Reporter: | Valentin Kavalenka | Assignee: | Valentin Kavalenka |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Documentation Changes: | Not Needed | ||||||||||||||||||||||||||||||||
| Description |
|
We observed a significant regression in the ycsb_100read benchmark (see ycsb-mongodb/workloads/workloadc and dsi/configurations/test_control/test_control.ycsb.yml). Just to give a perspective, the throughput dropped from 55k for 4.2.0, to 36k for 4.3.0, to 19k for 4.4.0 when the ycsb_100read benchmark is run against a single-node MongoDB replica set. |
| Comments |
| Comment by Githook User [ 27/Jan/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Valentin Kovalenko', 'email': 'valentin.male.kovalenko@gmail.com', 'username': 'stIncMale'}Message: Alternate fair/unfair lock modes in the connection pool to improve throughput (#862) Another tiny change is moving `pool.release(openConnection)` out of the guarded section
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 26/Jan/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The causeThe degradation is caused by fair locking introduced at first by
I compared only the snapshot of 4.5.0 (specifically, c52ff80) with 4.2.0, disregarding 4.3.0 and 4.4.0 because the same regression is observable with the snapshot of 4.5.0. The baselineThe cause is clear, now we need to gather specific characteristics of 4.2.0 to treat them as the baseline when judging the fix. Having the ycsb_100read benchmark is good, but I also wanted to have a more narrow benchmark that measures performance characteristics of a connection cycle (check out / check in). For this purpose I introduced this connectionCycle benchmark written with JMH. When viewing the results, one should keep in mind the following
Some of the baseline metric values deserve explanations. The baseline connectionCycle throughput for (1000 threads, 250 connections) is interesting34 ops/ms is 34_000 ops/s, i.e., just checking a connection out and checking it back has smaller throughput than the throughput in ycsb_100read for 1000 threads, 1000 connections, which is 45_000 ops/s. As I mentioned, we can't directly compare the results, because they were obtained on different computers, but we definitely can tell that the throughput measured in connectionCycle for this scenario is very low. I am sure ycsb_100read will notice this for 4.2.0 if run with fewer connections than the number of workload threads. The reason behind is that 4.2.0 uses a fair semaphore to enforce pool's maxSize. When there are enough available permits for all threads, threads do not queue waiting for an available permit, and fairness of the semaphore plays no role. However, when there are not enough available permits, which is definitely the case for (1000 threads, 250 connections), threads are not only queued waiting for a permit, but they all have to go at the end of the queue and wait their turn because the semaphore is fair. Due to this behavior, the throughput in connectionCycle drops by an order of magnitude. However, the latencies are all very similar, ranging from 28 to 43 ms. The baseline connectionCycle high percentiles of latencies for (1000 threads, 1000 connections) are interestingp0.9999 is almost 1 s, p1 is about 2 s. Remember, the benchmark just checks out and checks in connections, not doing anything else, and yet sometimes it takes 2 seconds. It is interesting and surprising at first because the semaphore used is fair. However, as explained above, threads are not blocked by this semaphore because there are enough available permits for them, which means that semaphore's fairness does not play a role. Instead, given that 1000 threads is way more than the number of threads the CPU can run in parallel, we likely observe here the unfairness of the OS thread scheduler. Thanks, jeff.yemin, for the idea of this explanation (my initial explanation was proven to be incorrect). Going deeper into the causeWe saw that making blocking behavior unfair in 4.5.0 solves the regression, but why is that the case, given that the connection pool in 4.2.0 also blocks threads in a fair manner by using a fair semaphore? Implementing each of the aforementioned tasks It is now clear that in ycsb_100read (1000 threads, 1000 connections) workload threads are not blocked in 4.2.0, but are blocked in 4.5.0 (even though no thread waits for an available permit because there are enough of them), and blocked in a fair manner, which further reduces the throughput measured by the benchmark.
|
| connectionCycle, 4.2.0-unfair | ||||
|---|---|---|---|---|
| metric \ workload threads, max pool size | 1000, 1000 | 6, 6 | 1, 1 | 1000, 250 |
| throughput, ops/ms | 2673.850 ± 166.890 | 2729.713 ± 60.227 | 5130.801 ± 116.675 | 2754.849 ± 172.705 |
| avg latency, ms | 0.358 ± 0.004 | - | - | 0.383 ± 0.003 |
| p0 (min) latency, ms | ≈ 10⁻⁴ | - | - | ≈ 10⁻⁵ |
| p0.5 latency, ms | 0.004 | - | - | 0.004 |
| p0.9 latency, ms | 0.008 | - | - | 0.007 |
| p0.95 latency, ms | 0.009 | - | - | 0.009 |
| p0.99 latency, ms | 0.014 | - | - | 0.014 |
| p0.999 latency, ms | 0.027 | - | - | 73.138 |
| p0.9999 latency, ms | 917.504 | - | - | 590.348 |
| p1 (max) latency, ms | 2048.918 | - | - | 2122.318 |
We can see that unfair locking which comes into play for (1000 threads, 250 connections) causes quite significant latencies. This seems to be the evidence in support of the connection pool specification fairness requirement.
What if we had fair locking behavior only when there are not enough available permits, and unfair locking behavior when there are enough of them? This behavior is what we had in 4.2.0 with semaphore, provided that only fairness is taken into account.
Fairness of Java SE API locks is not re-configurable. However, ReentrantReadWriteLock.WriteLock.tryLock disregards fairness and allows a thread to acquire a lock while ignoring all the queued threads, provided that the lock is not held by another thread. Calling tryLock and then lock if the former did not succeed, emulates unfair lock to an extent.
All we need now is to know whether there are waiting threads in order to decide if fair locking is needed. ReentrantReadWriteLock.hasWaiters requires the calling thread to hold the lock, and we need to know whether there are waiters before acquiring the lock. Well, we can estimate this by counting waiters2 on our own.
Here is what we get as a result of applying this approach:
| ycsb_100read, 4.5.0-snapshot-proposed | ||||
|---|---|---|---|---|
| metric \ workload threads | 1000 | 100 | 6 | 1 |
| throughput, ops/s | 47704 | 57610 | 16358 | 2887 |
| avg latency, μs | 20805 | 1709 | 362 | 341 |
| p0 (min) latency, μs | 257 | 276 | 289 | 304 |
| p0.95 latency, μs | - | 1960 | 400 | 380 |
| p0.99 latency, μs | - | 3000 | 440 | 480 |
| p1 (max) latency, μs | 1985727 | 564845 | 389947 | 351269 |
| connectionCycle, 4.5.0-snapshot-proposed | ||||
|---|---|---|---|---|
| metric \ workload threads, max pool size | 1000, 1000 | 6, 6 | 1, 1 | 1000, 250 |
| throughput, ops/ms | 656.229 ± 181.564 | 2066.979 ± 52.481 | 3207.597 ± 183.752 | 179.161 ± 60.814 |
| avg latency, ms | 0.967 ± 0.002 | - | - | 4.454 ± 0.010 |
| p0 (min) latency, ms | ≈ 10⁻⁴ | - | - | ≈ 10⁻⁴ |
| p0.5 latency, ms | ≈ 10⁻³ | - | - | ≈ 10⁻³ |
| p0.9 latency, ms | 0.001 | - | - | 28.574 |
| p0.95 latency, ms | 0.001 | - | - | 31.457 |
| p0.99 latency, ms | 31.556 | - | - | 69.075 |
| p0.999 latency, ms | 45.285 | - | - | 77.070 |
| p0.9999 latency, ms | 65.864 | - | - | 93.061 |
| p1 (max) latency, ms | 120.717 | - | - | 126.616 |
Let's compare this with the baseline and 4.5.0-snapshot:
| benchmark; metric \ version | 4.2.0 (baseline) | 4.5.0-snapshot | 4.5.0-snapshot-proposed | comment |
|---|---|---|---|---|
| ycsb_100read, (1000 threads, 1000 connections); throughput, ops/s | 45k | - | 48k | 👍 Obviously, the throughout here is not better, it is just not worse, which is good. |
| connectionCycle, (1000 threads, 1000 connections); throughput, ops/s | 2688k | 33k | 656k | 👎👍 Not good, but apparently, enough to achieve the baseline throughput in an integration benchmark such as ycsb_100read. That is why I consider this acceptable. Also, it's a significant improvement comparing to 4.5.0-snapshot. |
| connectionCycle, (1000 threads, 1000 connections); p0.9999 latency, ms | 886 | - | 66 | 👍 |
| connectionCycle, (1000 threads, 1000 connections); p1 latency, ms | 2072 | - | 121 | 👍 |
| connectionCycle, (1000 threads, 250 connections); throughput, ops/s | 34k | - | 179k | 👍 |
| connectionCycle, (1000 threads, 250 connections); p0.9999 latency, ms | 43 | - | 93 | 👎 |
| connectionCycle, (1000 threads, 250 connections); p1 latency, ms | 43 | - | 127 | 👎 |
I don't think that the proposed change is great, but it appears to be good enough to resolve the regression. Maybe with higher number of workload threads in the ycsb_100read benchmark or other integration benchmarks, the regression still manifests itself (though, definitely to a lower extent). Unfortunately, when I tried running ycsb_100read with 5000 threads, the benchmark runner failed to stop the benchmark after maxexecutiontime, and I did not try to play around with configuration or figure out the reason, I simply stopped at 1000 threads.
P.S. It's a long report, but I want the show the reasoning in detail, as well as provide other details because it opens opportunities for other engineers to find problems in the reasoning, question it, and either prove me wrong, or maybe even propose a better approach.
1 Clearly, a semaphore itself is a blocking synchronization mechanism, but we are interested specifically in the progress guarantees of acquiring a permit when the max number of permits all threads that use the semaphore may ever want to hold concurrently is not larger than the number of permits the semaphore maintains.
2 It may seem viable to use permits > 0 as a way to decide that there are likely no waiters, but benchmarking shows that with this approach high percentiles of contended connectionCycle latencies (when the number of threads that use the pool is higher than the maximum pool size) become similar to a situation when no fair locking is used. That is, this approach does not result in the behavior we want.