[JAVA-4452] Improve performance characteristics of the connection pool to be on par with the driver 4.2.0 Created: 25/Jan/22  Updated: 28/Oct/23  Resolved: 27/Jan/22

Status: Closed
Project: Java Driver
Component/s: Performance
Affects Version/s: None
Fix Version/s: 4.5.0

Type: Improvement Priority: Critical - P2
Reporter: Valentin Kavalenka Assignee: Valentin Kavalenka
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on JAVA-4451 Fix the await condition in OpenConcur... Closed
Problem/Incident
causes JAVA-4462 Introduce a benchmark that could have... Backlog
is caused by JAVA-3927 Rate limit new connection creations (... Closed
is caused by JAVA-3928 Connection pool paused state Closed
Related
related to JAVA-4454 Introduce checking benchmark results ... Closed
Backwards Compatibility: Fully Compatible
Documentation Changes: Not Needed

 Description   

We observed a significant regression in the ycsb_100read benchmark (see ycsb-mongodb/workloads/workloadc and dsi/configurations/test_control/test_control.ycsb.yml). Just to give a perspective, the throughput dropped from 55k for 4.2.0, to 36k for 4.3.0, to 19k for 4.4.0 when the ycsb_100read benchmark is run against a single-node MongoDB replica set.



 Comments   
Comment by Githook User [ 27/Jan/22 ]

Author:

{'name': 'Valentin Kovalenko', 'email': 'valentin.male.kovalenko@gmail.com', 'username': 'stIncMale'}

Message: Alternate fair/unfair lock modes in the connection pool to improve throughput (#862)

Another tiny change is moving `pool.release(openConnection)` out of the guarded section
in `DefaultConnectionPool.OpenConcurrencyLimiter.tryHandOverOrRelease`:
it does not have to be guarded, and when not only fair locking is used,
this change has a significant positive measurable effect
on the connection cycle (check out/check in) throughput.

JAVA-4452
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/146e7ce7e9b4839ba0959f45116d298c713c5abb

Comment by Valentin Kavalenka [ 26/Jan/22 ]

The cause

The degradation is caused by fair locking introduced at first by JAVA-3927 in 4.3.0 (DefaultConnectionPool.OpenConcurrencyLimiter.lock) and then additionally by JAVA-3928 in 4.4.0 (ConcurrentPool.StateAndPermits.lock). This was a hunch, which I am calling "an educated guess" to sound like other grown ups. To test this hypothesis, I run ycsb_100read via the DSI against each of the following: 4.2.0, snapshot of 4.5.0, modified 4.5.0 that uses unfair locks in the connection pool. The results showed that with unfair locking 4.5.0-snapshot is on par with 4.2.0 (values represent the throughput in ops/s against a standalone MongoDB installation which is the only topology I run ycsb_100read against):

workload threads \ driver version 4.2.0 4.5.0-snapshot 4.5.0-snapshot-unfair
1000 44648 15595 42399
100 56489 18258 57612

I compared only the snapshot of 4.5.0 (specifically, c52ff80) with 4.2.0, disregarding 4.3.0 and 4.4.0 because the same regression is observable with the snapshot of 4.5.0.

The baseline

The cause is clear, now we need to gather specific characteristics of 4.2.0 to treat them as the baseline when judging the fix. Having the ycsb_100read benchmark is good, but I also wanted to have a more narrow benchmark that measures performance characteristics of a connection cycle (check out / check in). For this purpose I introduced this connectionCycle benchmark written with JMH.

When viewing the results, one should keep in mind the following

  • ycsb_100read always has enough connections in the pool to serve all workload threads, while connectionCycle is sometimes run with less connections that workload threads.
  • ycsb_100read was run on a computer with 10 cores, 20 threads, while connectionCycle was run on a computer with 6 cores, 12 threads. So the results are not right away comparable across the benchmarks.
  • p0, p0.5, etc. are percentiles.
  • For some reason, the runner for ycsb_100read does not calculate p0.95 and p0.99 latencies for 1000 workload threads.
  • Throughput units: ops/s for ycsb_100read, ops/ms for connectionCycle.
  • Latency units: μs for ycsb_100read, ms for connectionCycle.
  • The "±" part represents 99.9% confidence interval, assuming that the distribution of all measurements is Gaussian.
  • All latency distributions were measured with the system under test being severely loaded to achieve maximum throughput. These latencies represent the worse case, and were used only to assess the fairness characteristics.
ycsb_100read, 4.2.0
metric \ workload threads 1000 100 6 1
throughput, ops/s 44648 56489 17172 3066
avg latency, μs 21228 1748 344 321
p0 (min) latency, μs 266 260 270 295
p0.95 latency, μs - 2140 380 340
p0.99 latency, μs - 3510 410 400
p1 (max) latency, μs 6755862 958470 363509 328838
connectionCycle, 4.2.0
metric \ workload threads, max pool size 1000, 1000 6, 6 1, 1 1000, 250
throughput, ops/ms 2687.579 ± 91.716 2730.714 ± 31.349 5081.488 ± 26.004 33.639 ± 1.238
avg latency, ms 0.354 ± 0.003 - - 29.480 ± 0.002
p0 (min) latency, ms ≈ 10⁻⁴ - - 27.984
p0.5 latency, ms 0.004 - - 29.295
p0.9 latency, ms 0.008 - - 29.983
p0.95 latency, ms 0.009 - - 30.671
p0.99 latency, ms 0.014 - - 34.603
p0.999 latency, ms 0.025 - - 41.091
p0.9999 latency, ms 886.047 - - 42.598
p1 (max) latency, ms 2071.986 - - 42.795

Some of the baseline metric values deserve explanations.

The baseline connectionCycle throughput for (1000 threads, 250 connections) is interesting

34 ops/ms is 34_000 ops/s, i.e., just checking a connection out and checking it back has smaller throughput than the throughput in ycsb_100read for 1000 threads, 1000 connections, which is 45_000 ops/s. As I mentioned, we can't directly compare the results, because they were obtained on different computers, but we definitely can tell that the throughput measured in connectionCycle for this scenario is very low. I am sure ycsb_100read will notice this for 4.2.0 if run with fewer connections than the number of workload threads.

The reason behind is that 4.2.0 uses a fair semaphore to enforce pool's maxSize. When there are enough available permits for all threads, threads do not queue waiting for an available permit, and fairness of the semaphore plays no role. However, when there are not enough available permits, which is definitely the case for (1000 threads, 250 connections), threads are not only queued waiting for a permit, but they all have to go at the end of the queue and wait their turn because the semaphore is fair. Due to this behavior, the throughput in connectionCycle drops by an order of magnitude. However, the latencies are all very similar, ranging from 28 to 43 ms.

The baseline connectionCycle high percentiles of latencies for (1000 threads, 1000 connections) are interesting

p0.9999 is almost 1 s, p1 is about 2 s. Remember, the benchmark just checks out and checks in connections, not doing anything else, and yet sometimes it takes 2 seconds. It is interesting and surprising at first because the semaphore used is fair. However, as explained above, threads are not blocked by this semaphore because there are enough available permits for them, which means that semaphore's fairness does not play a role. Instead, given that 1000 threads is way more than the number of threads the CPU can run in parallel, we likely observe here the unfairness of the OS thread scheduler. Thanks, jeff.yemin, for the idea of this explanation (my initial explanation was proven to be incorrect).

Going deeper into the cause

We saw that making blocking behavior unfair in 4.5.0 solves the regression, but why is that the case, given that the connection pool in 4.2.0 also blocks threads in a fair manner by using a fair semaphore?

Implementing each of the aforementioned tasks JAVA-3927, JAVA-3928 requires signalling to blocked threads when the condition they are waiting for happens, and doing this in a way that makes losing the signal impossible. Within the concurrency primitives known to me, condition variable (java.util.concurrent.locks.Condition) is the only way to do this. As a result, I had to replace the semaphore with a home-grown lock-based semaphore (ConcurrentPool.StateAndPermits), because this way I have access to the Condition API. A significant difference between java.util.concurrent.Semaphore and ConcurrentPool.StateAndPermits from the performance perspective is that they have different progress guarantees in a situation when there are enough available permits: ConcurrentPool.StateAndPermits.acquirePermit/releasePermit is blocking, while Semaphore.acquire/release is lock-free1, but not wait-free: compare this code in ConcurrentPool.StateAndPermits.acquirePermit with code in Semaphore, which represents a happy path for Semaphore.acquire.

It is now clear that in ycsb_100read (1000 threads, 1000 connections) workload threads are not blocked in 4.2.0, but are blocked in 4.5.0 (even though no thread waits for an available permit because there are enough of them), and blocked in a fair manner, which further reduces the throughput measured by the benchmark.

A solution: get rid of fairness

Wishful thinking led me to the following hypothesis: maybe we don't actually need fairness? I modified 4.2.0 to use an unfair semaphore, and took a look at the latency distribution in the connectionCycle benchmark. The results are

connectionCycle, 4.2.0-unfair
metric \ workload threads, max pool size 1000, 1000 6, 6 1, 1 1000, 250
throughput, ops/ms 2673.850 ± 166.890 2729.713 ± 60.227 5130.801 ± 116.675 2754.849 ± 172.705
avg latency, ms 0.358 ± 0.004 - - 0.383 ± 0.003
p0 (min) latency, ms ≈ 10⁻⁴ - - ≈ 10⁻⁵
p0.5 latency, ms 0.004 - - 0.004
p0.9 latency, ms 0.008 - - 0.007
p0.95 latency, ms 0.009 - - 0.009
p0.99 latency, ms 0.014 - - 0.014
p0.999 latency, ms 0.027 - - 73.138
p0.9999 latency, ms 917.504 - - 590.348
p1 (max) latency, ms 2048.918 - - 2122.318

We can see that unfair locking which comes into play for (1000 threads, 250 connections) causes quite significant latencies. This seems to be the evidence in support of the connection pool specification fairness requirement.

The proposed solution: alternate between fair and unfair modes

What if we had fair locking behavior only when there are not enough available permits, and unfair locking behavior when there are enough of them? This behavior is what we had in 4.2.0 with semaphore, provided that only fairness is taken into account.

Fairness of Java SE API locks is not re-configurable. However, ReentrantReadWriteLock.WriteLock.tryLock disregards fairness and allows a thread to acquire a lock while ignoring all the queued threads, provided that the lock is not held by another thread. Calling tryLock and then lock if the former did not succeed, emulates unfair lock to an extent.

All we need now is to know whether there are waiting threads in order to decide if fair locking is needed. ReentrantReadWriteLock.hasWaiters requires the calling thread to hold the lock, and we need to know whether there are waiters before acquiring the lock. Well, we can estimate this by counting waiters2 on our own.

Here is what we get as a result of applying this approach:

ycsb_100read, 4.5.0-snapshot-proposed
metric \ workload threads 1000 100 6 1
throughput, ops/s 47704 57610 16358 2887
avg latency, μs 20805 1709 362 341
p0 (min) latency, μs 257 276 289 304
p0.95 latency, μs - 1960 400 380
p0.99 latency, μs - 3000 440 480
p1 (max) latency, μs 1985727 564845 389947 351269
connectionCycle, 4.5.0-snapshot-proposed
metric \ workload threads, max pool size 1000, 1000 6, 6 1, 1 1000, 250
throughput, ops/ms 656.229 ± 181.564 2066.979 ± 52.481 3207.597 ± 183.752 179.161 ± 60.814
avg latency, ms 0.967 ± 0.002 - - 4.454 ± 0.010
p0 (min) latency, ms ≈ 10⁻⁴ - - ≈ 10⁻⁴
p0.5 latency, ms ≈ 10⁻³ - - ≈ 10⁻³
p0.9 latency, ms 0.001 - - 28.574
p0.95 latency, ms 0.001 - - 31.457
p0.99 latency, ms 31.556 - - 69.075
p0.999 latency, ms 45.285 - - 77.070
p0.9999 latency, ms 65.864 - - 93.061
p1 (max) latency, ms 120.717 - - 126.616

Let's compare this with the baseline and 4.5.0-snapshot:

benchmark; metric \ version 4.2.0 (baseline) 4.5.0-snapshot 4.5.0-snapshot-proposed comment
ycsb_100read, (1000 threads, 1000 connections); throughput, ops/s 45k - 48k 👍 Obviously, the throughout here is not better, it is just not worse, which is good.
connectionCycle, (1000 threads, 1000 connections); throughput, ops/s 2688k 33k 656k 👎👍 Not good, but apparently, enough to achieve the baseline throughput in an integration benchmark such as ycsb_100read. That is why I consider this acceptable. Also, it's a significant improvement comparing to 4.5.0-snapshot.
connectionCycle, (1000 threads, 1000 connections); p0.9999 latency, ms 886 - 66 👍
connectionCycle, (1000 threads, 1000 connections); p1 latency, ms 2072 - 121 👍
connectionCycle, (1000 threads, 250 connections); throughput, ops/s 34k - 179k 👍
connectionCycle, (1000 threads, 250 connections); p0.9999 latency, ms 43 - 93 👎
connectionCycle, (1000 threads, 250 connections); p1 latency, ms 43 - 127 👎

I don't think that the proposed change is great, but it appears to be good enough to resolve the regression. Maybe with higher number of workload threads in the ycsb_100read benchmark or other integration benchmarks, the regression still manifests itself (though, definitely to a lower extent). Unfortunately, when I tried running ycsb_100read with 5000 threads, the benchmark runner failed to stop the benchmark after maxexecutiontime, and I did not try to play around with configuration or figure out the reason, I simply stopped at 1000 threads.

P.S. It's a long report, but I want the show the reasoning in detail, as well as provide other details because it opens opportunities for other engineers to find problems in the reasoning, question it, and either prove me wrong, or maybe even propose a better approach.


1 Clearly, a semaphore itself is a blocking synchronization mechanism, but we are interested specifically in the progress guarantees of acquiring a permit when the max number of permits all threads that use the semaphore may ever want to hold concurrently is not larger than the number of permits the semaphore maintains.
2 It may seem viable to use permits > 0 as a way to decide that there are likely no waiters, but benchmarking shows that with this approach high percentiles of contended connectionCycle latencies (when the number of threads that use the pool is higher than the maximum pool size) become similar to a situation when no fair locking is used. That is, this approach does not result in the behavior we want.

Generated at Thu Feb 08 09:02:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.