Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 2.25.0
Affects Version/s: None
Component/s: SDAM
Labels:
None

Quarter:
- FY27Q2
Confidence Status:
None

Assigned Teams:

Ruby Drivers

Documentation Changes:
Not Needed
Documentation Changes Summary:

Hide

1. What would you like to communicate to the user about this feature?
2. Would you like the user to see examples of the syntax and/or executable code and its output?
3. Which versions of the driver/connector does this apply to?

Show
1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

The Server Monitoring spec requires ("hello or legacy hello Cancellation", server-monitoring.md):

"When a client marks a server Unknown from Network error when reading or writing, clients MUST cancel the hello or legacy hello check on that server and close the current monitoring connection."

The rationale ("Drivers cancel in-progress monitor checks") is that a non-timeout network error on one application connection implies all other connections to that server are bad, so the next check must run on a freshly established connection.

The Ruby driver implements this only partially. When an application operation fails with a network error, Server::Connection#handle_errors calls unknown!(stop_push_monitor: true), and Server#unknown! only stops the PushMonitor (lib/mongo/server.rb, monitor&.stop_push_monitor!). The polling Monitor's own connection is never touched:

Streaming mode: the streaming (awaited hello) connection is closed, but the polling Monitor keeps its long-lived connection and the recovery check reuses it (Monitor#check dispatches on @connection when present, lib/mongo/server/monitor.rb). The server can therefore transition back to a known state without the driver ever proving it can establish a new connection.
Polling mode (server_monitoring_mode: :poll, or :auto on FaaS): there is no PushMonitor, so nothing is closed and no in-progress check is cancelled at all.

Impact: consider a degraded node that resets established application connections and cannot accept new connections, while the monitor's old socket stays healthy (the spec rationale gives Azure closing idle pool connections as an example; a node with a failed disk is another). A compliant driver closes the monitoring connection, fails to re-establish it, and keeps the server Unknown until the node actually accepts connections again. The Ruby driver instead re-validates the server over its surviving pre-failure socket and marks it available again, so operations keep being routed to a node that cannot serve them. This is one of the suspected contributing factors in the incident behind RUBY-3890 (driver kept using an unhealthy node until the customer restarted all clients).

Fix: on an Unknown transition caused by a network error when reading or writing, in addition to stopping the PushMonitor, cancel any in-progress check and disconnect the polling Monitor's connection so the next check must establish a fresh one (TCP + TLS + handshake), in both streaming and polling modes.

is related to

RUBY-3911 Handshake network errors are always labeled SystemOverloadedError, so the server is never marked Unknown from connection-establishment failures

Ready for Work

Assignee:: Dmitry Rybakov
Reporter:: Dmitry Rybakov
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jul 03 2026 02:00:22 PM UTC
Updated:: Jul 14 2026 03:30:13 PM UTC
Resolved:: Jul 14 2026 03:30:13 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates