In mongoc_cluster_run_command_internal, and possibly in _mongoc_cluster_stream_for_server, we call mongoc_cluster_disconnect_node if we get a network error, but we don't always call mongoc_topology_invalidate_server, so we can continue to do failing operations on the same server without first re-scanning the topology.
The Server Discovery and Monitoring spec says:
If there is a network timeout on the connection after the handshake completes, the client MUST NOT mark the server Unknown. (A timeout may indicate a slow operation on the server, rather than an unavailable server.) If, however, there is some other network error on the connection after the handshake completes, the client MUST replace the server's description with a default ServerDescription of type Unknown, and fill the ServerDescription's error field with useful information, the same as if an error or timeout occurred before the handshake completed.
Audit all mongoc_cluster_disconnect_node calls and check if they properly call mongoc_topology_invalidate_server after non-timeout network errors. Consider a refactoring to make this mistake less likely. Perhaps add a bool to mongoc_cluster_disconnect_node to tell it to call mongoc_topology_invalidate_server.
- is related to
-
CDRIVER-2174 _mongoc_cluster_check_interval() should invalidate nodes after detecting a closed socket
- Closed