Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Service Arch
Backwards Compatibility:
Minor Change
Operating System:
ALL
Sprint:
Service Arch 2023-04-17, Service Arch 2023-05-01, Service Arch 2023-10-16, Service Arch 2023-10-30, Service Arch 2023-11-13, Service Arch 2023-11-27, Service Arch 2023-12-11, Service Arch 2023-12-25, Service Arch 2024-01-08, Service Arch 2024-01-22, Service Arch 2024-02-05, Service Arch 2024-02-19
Case:
Linked BF Score:
21
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When a user uses the NetworkInterfaceTL::startCommand API, they provide a target or list of targets as HostAndPort(s) to send a command to. The NetworkInterface will then attempt to acquire a connection to that HostAndPort to send the command. The connection pool used by the NetworkInterface has an internal timeout it enforces for how long a request can wait to acquire a connection (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/executor/connection_pool.cpp#L778) ; if that timeout is exceeded before a connection is acquired, the request will be failed with a NetworkInterfaceExceededTimeLimit error.

Importantly, this error (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/base/error_codes.yml#L247) is not in the error categories Retryable or NetworkError or NotPrimary. This means that higher-level APIs like the Shard class or the AsyncRequestSender will not retry on these NetworkInterfaceExceededTimeLimit errors.

This is an issue when we fail to acquire connections to a host because it is down, but we may be able to re-target to i.e. a new primary and successfully complete the command. There is essentially a race condition: if the SDAM system/RSM learns that the host is down and fails pending requests for connections with a network error like HostUnreachable, NetworkInterfaceTL requests will fail with an error that higher levels can retry. But if the connection pool's internal-timeout fires first and fails requests with NetworkInterfaceExceededTimeLimit, the command can't be retried.

This can lead to commands that should succeed (upon retry) failing, and in the leaking of an unhelpful internal timeout error that drivers do not know how to retry back to the user-application.

To fix, we should ensure that when the NetworkInterface returns a helpful error that allows for retries in cases where it transiently can't reach a HostAndPort target for potentially transient reasons. We should also consider the purpose of this internal time-out and what value it provides/what the contract of this internal-timeout firing should mean for higher-level APIs using the NetworkInterface.

related to

SERVER-92917 Fix retryability of connection pool timeout on versions earlier than v8.0

Open

SERVER-80020 The exhaustiveFindOnConfig() method should retry on NetworkInterfaceExceededTimeLimit errors

Backlog

SERVER-90622 revisit handling of NetworkInterfaceExceededTimeLimit errors in sharding code

Backlog

SERVER-79771 Make Resharding Operation Resilient to NetworkInterfaceExceededTimeLimit

Closed

Assignee:: Alex Li
Reporter:: George Wangensteen (Inactive)
Participants:: Alex Li, George Wangensteen, Githook User
Votes:: 0 Vote for this issue
Watchers:: 15 Start watching this issue

Created:: Dec 12 2022 07:51:52 PM UTC
Updated:: Jun 06 2025 08:14:03 AM UTC
Resolved:: Feb 09 2024 05:00:12 PM UTC
Confidence Status Last Update:: 10/Oct/23 12:38 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates