-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Service Arch
-
Minor Change
-
ALL
-
Service Arch 2023-04-17, Service Arch 2023-05-01, Service Arch 2023-10-16, Service Arch 2023-10-30, Service Arch 2023-11-13, Service Arch 2023-11-27, Service Arch 2023-12-11, Service Arch 2023-12-25, Service Arch 2024-01-08, Service Arch 2024-01-22, Service Arch 2024-02-05, Service Arch 2024-02-19
-
21
When a user uses the NetworkInterfaceTL::startCommand API, they provide a target or list of targets as HostAndPort(s) to send a command to. The NetworkInterface will then attempt to acquire a connection to that HostAndPort to send the command. The connection pool used by the NetworkInterface has an internal timeout it enforces for how long a request can wait to acquire a connection (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/executor/connection_pool.cpp#L778) ; if that timeout is exceeded before a connection is acquired, the request will be failed with a NetworkInterfaceExceededTimeLimit error.
Importantly, this error (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/base/error_codes.yml#L247) is not in the error categories Retryable or NetworkError or NotPrimary. This means that higher-level APIs like the Shard class or the AsyncRequestSender will not retry on these NetworkInterfaceExceededTimeLimit errors.
This is an issue when we fail to acquire connections to a host because it is down, but we may be able to re-target to i.e. a new primary and successfully complete the command. There is essentially a race condition: if the SDAM system/RSM learns that the host is down and fails pending requests for connections with a network error like HostUnreachable, NetworkInterfaceTL requests will fail with an error that higher levels can retry. But if the connection pool's internal-timeout fires first and fails requests with NetworkInterfaceExceededTimeLimit, the command can't be retried.
This can lead to commands that should succeed (upon retry) failing, and in the leaking of an unhelpful internal timeout error that drivers do not know how to retry back to the user-application.
To fix, we should ensure that when the NetworkInterface returns a helpful error that allows for retries in cases where it transiently can't reach a HostAndPort target for potentially transient reasons. We should also consider the purpose of this internal time-out and what value it provides/what the contract of this internal-timeout firing should mean for higher-level APIs using the NetworkInterface.
- related to
-
SERVER-92917 Fix retryability of connection pool timeout on versions earlier than v8.0
- Open
-
SERVER-80020 The exhaustiveFindOnConfig() method should retry on NetworkInterfaceExceededTimeLimit errors
- Backlog
-
SERVER-90622 revisit handling of NetworkInterfaceExceededTimeLimit errors in sharding code
- Blocked
-
SERVER-79771 Make Resharding Operation Resilient to NetworkInterfaceExceededTimeLimit
- Closed