[DRIVERS-746] Drivers should retry operations if connection handshake fails Created: 11/Oct/19  Updated: 12/Dec/22

Status: Implementing
Project: Drivers
Component/s: Retryability
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jeffrey Yemin Assignee: Neal Beeken
Resolution: Unresolved Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to GODRIVER-2191 Drivers should retry operations if co... Blocked
split to CDRIVER-4192 Drivers should retry operations if co... Closed
split to CSHARP-3919 Drivers should retry operations if co... Closed
split to RUBY-2815 Drivers should retry operations if co... Closed
split to RUST-1064 Drivers should retry operations if co... Closed
split to JAVA-4354 Drivers should retry operations if co... Closed
split to MOTOR-836 Drivers should retry operations if co... Closed
split to PYTHON-2951 Drivers should retry operations if co... Closed
split to PHPLIB-1042 Drivers should retry operations if co... Blocked
split to CXX-2393 Drivers should retry operations if co... Closed
split to NODE-3688 Drivers should retry operations if co... Closed
Related
related to DRIVERS-2032 Clarify server pinning behavior and p... Backlog
related to DRIVERS-2489 Improve test coverage for retryable h... Implementing
is related to DRIVERS-1842 Drivers should retry authentication e... Backlog
is related to DRIVERS-1390 Clarify that connection checkout for ... Closed
is related to DRIVERS-2247 Add tests for non-retryable handshake... Backlog
is related to DRIVERS-2140 Clarify Auth Spec and Clean Up Error ... Backlog
Driver Changes: Needed
Quarter: FY23Q1
Downstream Changes Summary:
Case:
Engineering Lead: Jeffrey Yemin Jeffrey Yemin
Product Manager: Rachelle Palmer Rachelle Palmer
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4192 Fixed 1.24.0
CXX-2393 Fixed 3.8.0
CSHARP-3919 Fixed 2.15.0
GODRIVER-2191 Blocked
JAVA-4354 Fixed 4.6.0
NODE-3688 Fixed 4.6.0
PHPLIB-1042 Blocked
PYTHON-2951 Fixed 4.1
MOTOR-836 Duplicate
RUBY-2815 Fixed 2.18.0
RUST-1064 Fixed 2.2.0
SWIFT-1378 Fixed 1.4.0

 Description   

This ticket is intended to track the work required to make all drivers resilient to connection handshake network errors with respect to retryable reads and writes.

Currently, the retryable reads and retryable writes specifications do not require that operations are retried in the following case:

  • the driver successfully selects a server for the operation
  • there is no idle connection already available in the pool
  • the driver attempts to open a new connection to the server and complete the connection handshake
  • the connection handshake fails with a network error

Instead, the specifications allow drivers to fail the operation without retrying, even though retrying would be safe in this case for both reads and writes.

One could read these specifications such that network errors during a connection handshake fall under the definition of retryable error, but the specifications are not explicit that this applies to the handshake and in practice multiple drivers do not interpret it that way, and there are no tests defined to assert the behavior.

Similarly, the server selection spec states that

After a server is selected, several error conditions could still occur that make the selected server unsuitable for sending the operation, such as:

  • the server could have shutdown the socket (e.g. a primary stepping down),
  • a connection pool could be empty, requiring new connections; those connections could fail to connect or could fail the server handshake

This specification does not require nor prohibit drivers from attempting automatic recovery for various cases where it might be considered reasonable to do so, such as:

  • repeating server selection if, after selection, a socket is determined to be unsuitable before a message is sent on it

but note it's not a MUST in the specification so in practice drivers differ in the behavior.



 Comments   
Comment by Jeremy Mikola [ 31/Aug/22 ]

Updated downstream changes to add missing commit hash.

Comment by Githook User [ 23/Feb/22 ]

Author:

{'name': 'Neal Beeken', 'email': 'neal.beeken@mongodb.com', 'username': 'nbbeeken'}

Message: DRIVERS-746: Make handshake errors retryable (#1115)
Branch: master
https://github.com/mongodb/specifications/commit/082306075eb2d125c47503e94fabc42f00a16784

Comment by Patrick Freed [ 05/Jan/21 ]

Following up the discussion in DRIVERS-1483: this ticket would be really nice for DRIVERS-781, since we now eagerly evict threads from the WaitQueue with an error. Since these threads would not have even attempted connection establishment when they get this error, they certainly could be retried assuming we treat "connection handshake fails" from the title as "connection check out fails".

If we could prioritize this such that drivers will implement it before they release any of their DRIVERS-781 changes, that would be great. If that won't be possible, please let me know so I can make the appropriate adjustments for DRIVERS-781, though ideally this could be done all at once.

cc jmikola isabel.atkinson shane.harvey

Generated at Thu Feb 08 08:22:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.