[CDRIVER-3654] Pooled handshake does not handle network errors correctly Created: 06/May/20  Updated: 23/Mar/23

Status: Backlog
Project: C Driver
Component/s: libmongoc, network
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Albertson Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

If a network error occurs before the ismaster handshake completes, SDAM says this should invalidate the server description if the connection's generation is valid:

If there is a network error or timeout on the connection before the handshake completes, the client MUST replace the server's description with a default ServerDescription of type Unknown, and fill the ServerDescription's error field with useful information.

The current behavior is a bit buggy.

_mongoc_stream_run_ismaster uses the current server description and runs a the ismaster with mongoc_cluster_run_command_private. That function handles network errors as if they are post-handshake errors (invalidates a server if non-timeout).

When that error bubbles up to _mongoc_cluster_stream_for_server, it ends up invalidating the server again.

I believe this has been a long-standing issue, and in practice this may not be terribly problematic to have multiple invalidations. Here's one such scenario where this could happen.

  • Thread A creates a connection with generation 0.
  • Thread B receives a network error and invalidates the server, incrementing the generation to 1.
  • Thread A begins the handshake, calling _mongoc_stream_run_ismaster which retrieves the server description with generation 1
  • Thread A receives a network error when performing the handshake, thinks it has the latest generation (though it really doesn't), and invalidates again.

The introduction of a connection generation of CDRIVER-3615 should prevent this behavior (only the first invalidation wins). Unfortunately, the server description retrieved by _mongoc_stream_run_ismaster could have a later generation than when the stream was created (causing a double invalidation) and mongoc_cluster_run_command_private does not check the generation (so it always invalidates).


Generated at Wed Feb 07 21:18:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.