[CDRIVER-2174] _mongoc_cluster_check_interval() should invalidate nodes after detecting a closed socket Created: 02/Jun/17  Updated: 28/Oct/23  Resolved: 18/Jun/17

Status: Closed
Project: C Driver
Component/s: libmongoc
Affects Version/s: 1.6.3
Fix Version/s: 1.7.0

Type: Bug Priority: Major - P3
Reporter: Jeremy Mikola Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to CDRIVER-2237 Server not invalidated in topology if... Closed
is related to CDRIVER-2172 Call "ping" on a socket that has been... Closed

 Description   

Quoting the SDAM spec on Network error when reading or writing:

If there is a network timeout on the connection after the handshake completes, the client MUST NOT mark the server Unknown. (A timeout may indicate a slow operation on the server, rather than an unavailable server.) If, however, there is some other network error on the connection after the handshake completes, the client MUST replace the server's description with a default ServerDescription of type Unknown, and fill the ServerDescription's error field with useful information, the same as if an error or timeout occurred before the handshake completed.

This doesn't jive with the implementation of _mongoc_cluster_check_interval(). Before the logic for socketCheckIntervalMS (i.e. isMaster for sockets last used more than five seconds ago by default), there is a basic check for a closed socket if the socket was last used more than a second ago (CHECK_CLOSED_DURATION_MSEC is defined as 1000):

if (scanner_node->last_used + (1000 * CHECK_CLOSED_DURATION_MSEC) < now) {
  if (mongoc_stream_check_closed (stream)) {
     mongoc_cluster_disconnect_node (cluster, server_id);
     bson_set_error (error, MONGOC_ERROR_STREAM, MONGOC_ERROR_STREAM_SOCKET,
                     "Stream is closed");
     return false;
  }
}

I would expect mongoc_topology_invalidate_server() to be called here, as it often is in other places following mongoc_cluster_disconnect_node() on socket errors.


I attempted to test this using the following PHP script:

<?php
 
function ping() {
    $manager = new \MongoDB\Driver\Manager('mongodb://127.0.0.1:27017?serverSelectionTryOnce=false&serverSelectionTimeoutMS=6000');
 
    printf("Start time: %.6f\n", $s = microtime(true));
    try {
        $server = $manager->selectServer(new MongoDB\Driver\ReadPreference(MongoDB\Driver\ReadPreference::RP_PRIMARY));
        echo "Selected primary\n";
        $server->executeCommand('admin', new MongoDB\Driver\Command(['ping' => 1]));
        echo "Ping successful\n";
    } catch (\Exception $e) {
        printf("ERROR: %s\n", $e->getMessage());
    }
    printf("End time: %.6f\n", $e = microtime(true));
    printf("Elapsed time: %.6f\n", $e - $s);
    echo "<hr>\n";
}
 
for ($i = 0; $i < 10; $i++) {
    ping();
    // Ensure that we surpass CHECK_CLOSED_DURATION_MSEC
    usleep(1500 * 1000);
}

If I start this script with a running MongoDB server on localhost, and restart it during the test, we get the following output (abridged to the relevant iterations around the restart):

Start time: 1496422389.810717
Selected primary
Ping successful
End time: 1496422389.811190
Elapsed time: 0.000473
 
Start time: 1496422391.311405
Selected primary
ERROR: Stream is closed
End time: 1496422391.311774
Elapsed time: 0.000369
 
Start time: 1496422392.811984
Selected primary
ERROR: Failed to send "ismaster" command with database "admin": Failure during socket delivery: Broken pipe (32)
End time: 1496422392.812394
Elapsed time: 0.000410
 
Start time: 1496422394.312595
Selected primary
Ping successful
End time: 1496422399.315092
Elapsed time: 5.002497

Notice that two exceptions are thrown before we can recover from the socket error. If it's not immediately obvious, recovery occurs during the last iteration, which has a higher elapsed time. Also note that server selection succeeded immediately following the first "Stream is closed" error because the node was never invalidated.

If we revise the original script to use a smaller usleep() interval to ensure we don't surpass CHECK_CLOSED_DURATION_MSEC, we only encounter one exception:

Start time: 1496422345.924555
Selected primary
Ping successful
End time: 1496422345.925055
Elapsed time: 0.000500
 
Start time: 1496422346.425322
Selected primary
ERROR: Failed to read 4 bytes: socket error or timeout
End time: 1496422346.425683
Elapsed time: 0.000361
 
Start time: 1496422346.925905
Selected primary
Ping successful
End time: 1496422351.928534
Elapsed time: 5.002629


I noted that this CHECK_CLOSED_DURATION_MSEC logic is specific to the single-threaded SDAM implementation, but I was unable to correlate it with any section of the SDAM spec. Do you have a reference for it?

That said, I believe we can solve this behavior by adding a mongoc_topology_invalidate_server() call after bson_set_error() and before return false in the mongoc_stream_check_closed() condition cited earlier.



 Comments   
Comment by A. Jesse Jiryu Davis [ 18/Jun/17 ]

Closed along with CDRIVER-2172:

https://github.com/mongodb/mongo-c-driver/commit/9f9934dcd9d7de3e7f8db315145106df44edee23

Comment by A. Jesse Jiryu Davis [ 07/Jun/17 ]

Same in mongoc_cluster_run_command_internal.

Generated at Wed Feb 07 21:14:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.