Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-72055

NetworkInterfaceTL should by default return a retryable error when it times out waiting to acquire a connection

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Service Arch
    • Minor Change
    • ALL
    • Service Arch 2023-04-17, Service Arch 2023-05-01, Service Arch 2023-10-16, Service Arch 2023-10-30, Service Arch 2023-11-13, Service Arch 2023-11-27, Service Arch 2023-12-11, Service Arch 2023-12-25, Service Arch 2024-01-08, Service Arch 2024-01-22, Service Arch 2024-02-05, Service Arch 2024-02-19
    • 21

      When a user uses the NetworkInterfaceTL::startCommand API, they provide a target or list of targets as HostAndPort(s) to send a command to. The NetworkInterface will then attempt to acquire a connection to that HostAndPort to send the command. The connection pool used by the NetworkInterface has an internal timeout it enforces for how long a request can wait to acquire a connection (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/executor/connection_pool.cpp#L778) ; if that timeout is exceeded before a connection is acquired, the request will be failed with a NetworkInterfaceExceededTimeLimit error. 

       

      Importantly, this error (https://github.com/mongodb/mongo/blob/19ee1cedb8a1fbfc6c7a73db5211a592cc96b642/src/mongo/base/error_codes.yml#L247) is not in the error categories Retryable or NetworkError or NotPrimary. This means that higher-level APIs like the Shard class or the AsyncRequestSender will not retry on these NetworkInterfaceExceededTimeLimit errors. 

      This is an issue when we fail to acquire connections to a host because it is down, but we may be able to re-target to i.e. a new primary and successfully complete the command. There is essentially a race condition: if the SDAM system/RSM learns that the host is down and fails pending requests for connections with a network error like HostUnreachable, NetworkInterfaceTL requests will fail with an error that higher levels can retry. But if the connection pool's internal-timeout fires first and fails requests with NetworkInterfaceExceededTimeLimit, the command can't be retried.

      This can lead to commands that should succeed (upon retry) failing, and in the leaking of an unhelpful internal timeout error that drivers do not know how to retry back to the user-application.

      To fix, we should ensure that when the NetworkInterface returns a helpful error that allows for retries in cases where it transiently can't reach a HostAndPort target for potentially transient reasons. We should also consider the purpose of this internal time-out and what value it provides/what the contract of this internal-timeout firing should mean for higher-level APIs using the NetworkInterface. 

            Assignee:
            alex.li@mongodb.com Alex Li
            Reporter:
            george.wangensteen@mongodb.com George Wangensteen
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: