[JAVA-4034] Refactor read/write retries to accommodate for both PoolClearedError and CSOT Created: 05/Mar/21  Updated: 28/Oct/23  Resolved: 27/Sep/21

Status: Closed
Project: Java Driver
Component/s: Internal
Affects Version/s: None
Fix Version/s: 4.4.0

Type: Improvement Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Valentin Kavalenka
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
has to be done before JAVA-4028 Provide explicit guidance on handling... Closed
Issue split
Epic Link: Avoiding connection storms
Quarter: FY22Q4
Backwards Compatibility: Fully Compatible
Documentation Changes: Not Needed

 Description   

DRIVERS Ticket Description
Script Target - If you can read this text, the script has failed


 Comments   
Comment by Githook User [ 25/Jul/22 ]

Author:

{'name': 'Jeff Yemin', 'email': 'jeff.yemin@mongodb.com', 'username': 'jyemin'}

Message: Do not retry a read operation when in a transaction (#982)

Fixes a regression introduced in 4.4.0 in scope of JAVA-4034, and
was undetected due to the lack of a specification test for the
required behavior.

JAVA-4684
Branch: 4.7.x
https://github.com/mongodb/mongo-java-driver/commit/f42ef7d028dc4b5b2bdda3e32bfe1f8991c2c21b

Comment by Githook User [ 25/Jul/22 ]

Author:

{'name': 'Jeff Yemin', 'email': 'jeff.yemin@mongodb.com', 'username': 'jyemin'}

Message: Do not retry a read operation when in a transaction (#982)

Fixes a regression introduced in 4.4.0 in scope of JAVA-4034, and
was undetected due to the lack of a specification test for the
required behavior.

JAVA-4684
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/afb5d0ae6693d1915b7afa3b84d250974fe15998

Comment by Githook User [ 27/Sep/21 ]

Author:

{'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}

Message: JAVA-4034 Refactor read/write retries to accommodate for both `PoolClearedError` and CSOT (#782)

Changes in this commit do the following two major things:

1) These changes make selecting a server (`binding.get*ConnectionSource`)
and checking out a connection (`source.getConnection`) part of a retryable operation.
Such a change makes `MongoConnectionPoolClearedException`s part of the operation,
thus allowing us to retry the operation when the exception happens.

2) The retry limit (not more than two attempts) was hardcoded in the driver's code structure,
i.e., the code was written in such a way that it was describing the first attempt, then depending
on the outcome, the code for the second attempt was executed, with the code for the second attempt
mostly duplicating the code for the first attempt, but having different error handling. Such an approach
not only negatively affects the code readability, but also prevents changing the number of attempts,
let alone making decisions on whether to retry based not on an attempt limit.
The pseudocode in the specifications is also written this way, and at least some other drivers,
e.g., C#, Rust, took a similar approach to structure the code.

Changes in this commit represent the outcome of refactoring the code related to the retry logic.
In the new code the retry logic is decoupled from the logic of a retryable operation as much as possible.
Since our operations are quite complex and may decide to break retrying in the middle of an attempt,
e.g., because the server selected in the attempt does not support retries, the operation logic
is still aware of the fact that it may be retried. However, it is important to understand that
this awareness is a direct consequence of the business logic. It cannot be gotten rid of
regardless of the approach taken to structure the code. Operations that have simpler business logic
can be written in a retry-agnostic way without making changes to the retry framework
that was added as part of this commit.

With the changes in this commit, applying the client-side timeout (CSOT) to a retryable operation
is as simple as replacing the hard retry limit condition with a condition that checks
whether there is time left for attempting the operation again.

JAVA-4034
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/c11600db1156d65bbf1db584111f4778aaccafc4

Comment by Cloud GitHub Webhooks [ 27/Sep/21 ]

stIncMale merged a pull request (JAVA-4034 Refactor read write retries to accommodate for both `PoolClearedError` and CSOT) into the following branch:
master: c11600db1156d65bbf1db584111f4778aaccafc4

Comment by Valentin Kavalenka [ 04/May/21 ]

Currently the retry logic is hardcoded to have at most two attempts (one retry) and is located in the execution order of a command at points after selecting a server and checking out a connection for the first attempt. MongoConnectionPoolClearedException may happen when checking out a connection for the first attempt, therefore we must move the retry logic earlier in the execution order: before checking out a connection for the first attempt, but after selecting a server for the first attempt, as all specifications either explicitly indicate that errors while selecting the server for the first attempt are not retryable, or say nothing about this:

Following are the retryable operations and the relevant specifications:

operation and spec link retry code location
read sync, async, see also CommandOperationHelper.isRetryableException
write sync, async, see also CommandOperationHelper.isRetryableException, CommandOperationHelper.addRetryableWriteErrorLabel
bulk write and bulk API spec MixedBulkWriteOperation.execute, executeAsync, see also CommandOperationHelper.isRetryableException, CommandOperationHelper.addRetryableWriteErrorLabel
iterate over a change stream cursor (turned out, change streams create cursors via the normal read operations, which means nothing needs to be done; see wrapped.execute in ChangeStreamOperation.execute and AggregateOperationImpl.execute) ChangeStreamBatchCursor, AsyncChangeStreamBatchCursor, see also ChangeStreamBatchCursorHelper.isRetryableError

Consider DRIVERS-1570 and DRIVERS-1815 when doing the changes.

Generated at Thu Feb 08 09:01:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.