[JAVA-4146] Fix timeout handling in `DefaultConnectionPool.getAsync` Created: 07/May/21  Updated: 28/Oct/23  Resolved: 24/Jul/21

Status: Closed
Project: Java Driver
Component/s: Connection Management
Affects Version/s: None
Fix Version/s: 4.3.0

Type: Task Priority: Unknown
Reporter: Valentin Kavalenka Assignee: Valentin Kavalenka
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
is caused by JAVA-3927 Rate limit new connection creations (... Closed
Backwards Compatibility: Fully Compatible
Documentation Changes: Not Needed

 Description   

Before implementing maxConnecting

This section describes how asynchronous checkouts are implemented in the codebase between Nov 25, 2014 3b544aad086ea9b11039e1d188fbfa5ce12f7795 and before Apr 14, 2021 ead0357131c1c0baa364656b07c91f2c789918e3.

DefaultConnectionPool.getAsync uses a single dedicated thread to handle all asynchronous checkout requests. This naturally queues all such requests, and the thread does not detect that a request timed out until it starts serving it. As a result, if the request that is currently being served is blocked because maxSize is reached, the effective timeout of any queued asynchronous request becomes at least as large as the duration the currently blocked request is blocked for. In other words, the pool has a potential to significantly overdue timeouts of asynchronous requests.

However, if we constraint all asynchronous checkout timeouts to be equal (which is the case for now, but will no longer be true once CSOT is implemented), then a request closer to the tail of the queue cannot timeout before the request that is currently being served (the head request). When the head request times out, the dedicated thread is unblocked and proceeds with serving the next request in the line. Thus, the described approach while being incorrect in situations where request timeouts may be different, works correctly when timeouts are equal.

If we had different timeouts, then we could have maintained the request queue explicitly, and chose each blocking time to be the minimal of the remaining time for all queued requests including the one that is being served. Block the thread for this amount of time, then expire and remove from the queue all requests that timed out (this approach is used in com.mongodb.internal.connection.LoadBalancedCluster). When all timeouts are equal, the minimal remaining time is the one of the request that is being served, which makes no need for any additional machinery to properly handle timeouts of all queued requests.

After implementing maxConnecting

Since Apr 14, 2021 ead0357131c1c0baa364656b07c91f2c789918e3 the situation has changed. Now async checkout requests may also be blocked because maxConnecting is reached, and such blocking is done in another dedicated thread. Thus, now we have two queues of async checkout requests, one per a dedicated thread. And it is no longer true that of the requests in the second queue the one that has the smallest remaining time before timing out is the one that is currently being served (the head request). In other words, even without CSOT, the current implementation may significantly overdue timeouts of async checkout requests.

Approaches to solve the problem

One approach would be to handle timeout of queued async requests as mentioned above i.e., by changing the logic of each dedicated thread. Another approach is to have one more dedicated thread that monitors which queued requests have timed out, removes them from the queues and completes with a timeout exception. Both approaches require us maintaining request queues explicitly. The first approach results in smaller resource consumption (no need to have one more dedicated thread), while the second approach does not require much changes to the existing logic of the two dedicated threads (because it uses a third thread that may have the new timeout logic separated from the existing logic).

My plan is to try the second approach because it is simpler. It appears that the new (third) dedicated thread may be shared between all pools of the same MongoClient, thus significantly reducing the overhead. However, if we decide to implement such sharing, it should be done as a separate tsk.



 Comments   
Comment by Githook User [ 08/Jun/21 ]

Author:

{'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}

Message: Refactor `DefaultConnectionPool` to use single worker thread when checking out asynchronously (#722)

When implementing https://github.com/mongodb/mongo-java-driver/commit/73e30857eed2f1d02ff7e9a6e5028d557a1cf4cf
I correctly explained the deadlock situation:
1) An async task that created a non-open connection and tries to open it (let us call it an "open task")
may block another async task that tries to create a connection (let us call it a "get task")
if `maxSize` is reached.
2) If both tasks are submitted to the same queue such that the get task is ahead of the open task, then
such a get task naturally blocks the open task while being blocked by the open task.
but failed to optimally fix it.

The solution to the problem I used is to submit tasks to different queues, which,
because both tasks are blocking, required introducing one more async worker thread.

However, there is another approach to solving the deadlock problem,
which did not occur to me at that time: prevent the get task to be ahead of the open task
in the queue. This may happen due to the optimization in the `getAsync` method, which
tries to check out synchronously if an open connection is available,
i.e., if checking it out requires no blocking. It is a nice optimization that I
tried to preserve (it was there before my changes) when implemented JAVA-3927.
However, this optimization opens up opportunities for a get task being blocked
by an open task to be ordered ahead of the open task in the queue.

If I realized that previously, I would have sacrificed this optimization
to eliminate the deadlock instead of introducing another queue and a thread.
Fortunately, the it occurred to me now.

This change also fixes reactive `MainTransactionsTest` runner
so that the `abort.json`: "implicit abort" test succeeds.
This test expects command_started_event for the abortTransaction command, which is sent
as a result of closing the session used by the test. The close method is asynchronous and
the test must wait for its completion.

JAVA-4146
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/6ab9ce6462f5cc1a994e9a829a1a7c27ecb2c546

Comment by Jeffrey Yemin [ 10/May/21 ]

This is not really a documentation ticket anymore.  There needs to be a functional change.

Generated at Thu Feb 08 09:01:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.