[SERVER-72055] NetworkInterfaceTL should always return a retryable error when it times out waiting to acquire a connection Created: 12/Dec/22  Updated: 05/Feb/24

Status: In Code Review
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: George Wangensteen Assignee: Alex Li
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-80020 The exhaustiveFindOnConfig() method s... Backlog
related to SERVER-79771 Make Resharding Operation Resilient t... Closed
Assigned Teams:
Service Arch
Operating System: ALL
Sprint: Service Arch 2023-04-17, Service Arch 2023-05-01, Service Arch 2023-10-16, Service Arch 2023-10-30, Service Arch 2023-11-13, Service Arch 2023-11-27, Service Arch 2023-12-11, Service Arch 2023-12-25, Service Arch 2024-01-08, Service Arch 2024-01-22, Service Arch 2024-02-05, Service Arch 2024-02-19
Participants:
Linked BF Score: 19

 Description   

When a user uses the NetworkInterfaceTL::startCommand API, they provide a target or list of targets as HostAndPort(s) to send a command to. The NetworkInterface will then attempt to acquire a connection to that HostAndPort to send the command. The connection pool used by the NetworkInterface has an internal timeout it enforces for how long a request can wait to acquire a connection (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/executor/connection_pool.cpp#L778) ; if that timeout is exceeded before a connection is acquired, the request will be failed with a NetworkInterfaceExceededTimeLimit error. 

 

Importantly, this error (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/base/error_codes.yml#L247) is not in the error categories Retryable or NetworkError or NotPrimary. This means that higher-level APIs like the Shard class or the AsyncRequestSender will not retry on these NetworkInterfaceExceededTimeLimit errors. 

This is an issue when we fail to acquire connections to a host because it is down, but we may be able to re-target to i.e. a new primary and successfully complete the command. There is essentially a race condition: if the SDAM system/RSM learns that the host is down and fails pending requests for connections with a network error like HostUnreachable, NetworkInterfaceTL requests will fail with an error that higher levels can retry. But if the connection pool's internal-timeout fires first and fails requests with NetworkInterfaceExceededTimeLimit, the command can't be retried.

This can lead to commands that should succeed (upon retry) failing, and in the leaking of an unhelpful internal timeout error that drivers do not know how to retry back to the user-application.

To fix, we should ensure that when the NetworkInterface returns a helpful error that allows for retries in cases where it transiently can't reach a HostAndPort target for potentially transient reasons. We should also consider the purpose of this internal time-out and what value it provides/what the contract of this internal-timeout firing should mean for higher-level APIs using the NetworkInterface. 


Generated at Thu Feb 08 06:20:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.