Monitoring connection is not closed when a server is marked Unknown from an application network error

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Affects Version/s: None
    • Component/s: SDAM
    • None
    • None
    • Ruby Drivers
    • None
    • None
    • None
    • None
    • None
    • None

      The Server Monitoring spec requires ("hello or legacy hello Cancellation", server-monitoring.md):

      "When a client marks a server Unknown from Network error when reading or writing, clients MUST cancel the hello or legacy hello check on that server and close the current monitoring connection."

      The rationale ("Drivers cancel in-progress monitor checks") is that a non-timeout network error on one application connection implies all other connections to that server are bad, so the next check must run on a freshly established connection.

      The Ruby driver implements this only partially. When an application operation fails with a network error, Server::Connection#handle_errors calls unknown!(stop_push_monitor: true), and Server#unknown! only stops the PushMonitor (lib/mongo/server.rb, monitor&.stop_push_monitor!). The polling Monitor's own connection is never touched:

      • Streaming mode: the streaming (awaited hello) connection is closed, but the polling Monitor keeps its long-lived connection and the recovery check reuses it (Monitor#check dispatches on @connection when present, lib/mongo/server/monitor.rb). The server can therefore transition back to a known state without the driver ever proving it can establish a new connection.
      • Polling mode (server_monitoring_mode: :poll, or :auto on FaaS): there is no PushMonitor, so nothing is closed and no in-progress check is cancelled at all.

      Impact: consider a degraded node that resets established application connections and cannot accept new connections, while the monitor's old socket stays healthy (the spec rationale gives Azure closing idle pool connections as an example; a node with a failed disk is another). A compliant driver closes the monitoring connection, fails to re-establish it, and keeps the server Unknown until the node actually accepts connections again. The Ruby driver instead re-validates the server over its surviving pre-failure socket and marks it available again, so operations keep being routed to a node that cannot serve them. This is one of the suspected contributing factors in the incident behind RUBY-3890 (driver kept using an unhealthy node until the customer restarted all clients).

      Fix: on an Unknown transition caused by a network error when reading or writing, in addition to stopping the PushMonitor, cancel any in-progress check and disconnect the polling Monitor's connection so the next check must establish a fresh one (TCP + TLS + handshake), in both streaming and polling modes.

            Assignee:
            Unassigned
            Reporter:
            Dmitry Rybakov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: