[CDRIVER-2172] Call "ping" on a socket that has been idle for socketCheckIntervalMS Created: 01/Jun/17  Updated: 28/Oct/23  Resolved: 16/Jun/17

Status: Closed
Project: C Driver
Component/s: libmongoc
Affects Version/s: None
Fix Version/s: 1.7.0

Type: Bug Priority: Major - P3
Reporter: Jeremy Mikola Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on DRIVERS-390 Call "ping" on a socket that has been... Closed
is depended on by PHPC-1296 Call "ping" on a socket that has been... Closed
Related
related to CDRIVER-2174 _mongoc_cluster_check_interval() shou... Closed
is related to CDRIVER-2075 Retry ismaster calls once Closed

 Description   

Similar to CDRIVER-2075 (retry isMaster calls once), I think it may be worthwhile if _mongoc_cluster_check_interval() made one attempt to reestablish a closed stream before returning an error.

Using the basic case of a standalone topology as an example, I created a basic PHP script that connects to localhost:27017 and issues a ping command and served this through a single-worker web server to isolate our libmongoc client persistence to all requests. When I restarted the mongod server between requests (within the span of two seconds), the socket is left in the CLOSE_WAIT state and the subsequent request encounters a "Stream is closed" error.

socketCheckIntervalMS does not seem relevant here, as the _mongoc_cluster_check_interval() will return a "Stream is closed" error before deciding if socketCheckIntervalMS warrants an isMaster command.

The original test used our default server selection (serverSelectionTryOnce=true). When I switched to serverSelectionTryOnce=false, I did not notice any change in behavior. Is that by design, since libmongoc has technically left the server selection loop at this point? If that is the case, I think it supports the idea of retrying connections in this case.

I did notice that tuning heartbeatFrequencyMS lower to ensure a topology update between restarting mongod and the subsequent PHP request did avoid a "Stream is closed" error being reported to the user.


Note: if this issue is worth implementing, we should probably also revise the SDAM spec accordingly. I can create that ticket if needed.



 Comments   
Comment by Githook User [ 16/Jun/17 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-2172 check idle socket with "ping"

Server Selection Spec update for single-threaded drivers.

If a server is selected that has an existing connection that has been idle for
socketCheckIntervalMS, the driver MUST check the connection with the "ping"
command. If the ping succeeds, use the selected connection. If not, set the
server's type to Unknown and update the Topology Description according to the
Server Discovery and Monitoring Spec, and attempt once more to select a server.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/9f9934dcd9d7de3e7f8db315145106df44edee23

Comment by A. Jesse Jiryu Davis [ 14/Jun/17 ]

We've updated the specs:

https://github.com/mongodb/specifications/commit/4369d5667b663ab07b2928ad04fdec1b56b6c20f

What the driver should do is re-check an idle socket right after server selection. If a server is selected that has an existing connection that's been idle for 5 seconds, the driver should attempt to call "ping", not ismaster, since we're not trying to update the topology. If "ping" succeeds, use the socket. If it failed, set the server's type to Unknown and re-enter server selection once, re-discovering the topology if needed (as SDAM already specifies), then proceed from there.

Comment by Githook User [ 09/Jun/17 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-2172 robust mongoc_client_server_select

If a server is selected and its socket has not been used in 5 seconds,
check if the socket is still good and retry once if not.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/3f1a09b8d1c277df0b7efb79c5fc2acb03bf0d3b

Comment by A. Jesse Jiryu Davis [ 09/Jun/17 ]

I'm reopening this. Depending on how the related Server Discovery And Monitoring Spec change is resolved, this may require more work. At the very least, the logic should move from mongoc_client_select_server into mongoc_topology_select_server so that all single-threaded code paths benefit from the change, not just one function.

Comment by A. Jesse Jiryu Davis [ 06/Jun/17 ]

The PHP driver uses the C driver like:

server = mongoc_client_select_server (...);
do_something_with (server);

The best solution I can think of to make the above logic more resilient is to update mongoc_client_select_server:

for (i = 0; i < 2; i++) {
    sd = topology_select_server(...);
    if (single_threaded) {
        node = mongoc_topology_scanner_get_node (...);
        if (node && node is over 5 seconds old) {
            call ismaster
            if (error) {
              continue;
            }
        }
    }
    return sd;
}
 
return NULL;

Additionally, there's logic in mongoc_cluster_sendv_to_server to check a 1-second-old socket to see if it's closed (which is not in any spec), and check a 5-second-old socket with an ismaster (a misinterpretation of SDAM). If either check fails then mongoc_cluster_sendv_to_server returns an error, which is useless. Remove those checks from mongoc_cluster_sendv_to_server.

Comment by Jeremy Mikola [ 06/Jun/17 ]

Is that by design, since libmongoc has technically left the server selection loop at this point? If that is the case, I think it supports the idea of retrying connections in this case.

This may be problematic if the connection was dropped as a result a replica set failover, where a primary has stepped down to become a secondary. In that case, libmongoc thinks it has selected a primary (based on the last topology state) and after reconnecting it may be a secondary. In this case, we don't want to return a socket for a secondary and should instead report an error.

Comment by Jeremy Mikola [ 01/Jun/17 ]

Another observation: CDRIVER-2075 would allow libmongoc to recover from a dropped connection, which it does not become aware of until isMaster fails. In that case, mongoc_socket_check_closed() has failed to detect that the socket was dropped. In the OP, I'm describing a case where we close a connection between two ports on the local machine, so the OS is always aware that one side has dropped the connection and sets the state as CLOSE_WAIT. Given that, CDRIVER-2075 and this issue may be two sides of the same coin.

Generated at Wed Feb 07 21:14:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.