[CDRIVER-610] Client should retry every .5 sec until server selection timeout Created: 08/Apr/15  Updated: 08/Jan/24  Resolved: 28/Apr/15

Status: Closed
Project: C Driver
Component/s: None
Affects Version/s: 1.2.0
Fix Version/s: 1.2-beta0

Type: Bug Priority: Major - P3
Reporter: Hannes Magnusson Assignee: A. Jesse Jiryu Davis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to CDRIVER-594 Make all of the timeouts and interval... Closed
is related to CDRIVER-2484 Retry logic Closed
is related to PHPC-252 Application freezes for a minute when... Closed

 Description   

*Summary (Jesse)*: this was two bugs in the single-threaded implementation of the Server Discovery And Monitoring and Server Selection specs. First, after failing an initial connection, the client never re-attempted connection. Second, it spun in a tight loop until the server selection timeout (default 30 seconds) expired. The fix is to actually re-attempt a connection after each connection failure (until the timeout), and to pause half a second between attempts (the minHeartbeatFrequencyMS in Server Discovery And Monitoring).

*Original Report (Hannes)*:

Straight from the example docs - modifying the port to not-a-mongod will result in an endless loop.

#include <bson.h>
#include <mongoc.h>
#include <stdio.h>
 
int
main (int   argc,
      char *argv[])
{
    mongoc_client_t *client;
    mongoc_collection_t *collection;
    mongoc_cursor_t *cursor;
    bson_error_t error;
    bson_oid_t oid;
    bson_t *doc;
 
    mongoc_init ();
 
    client = mongoc_client_new ("mongodb://localhost:27016/");
    collection = mongoc_client_get_collection (client, "test", "test");
 
    doc = bson_new ();
    bson_oid_init (&oid, NULL);
    BSON_APPEND_OID (doc, "_id", &oid);
    BSON_APPEND_UTF8 (doc, "hello", "world");
 
    if (!mongoc_collection_insert (collection, MONGOC_INSERT_NONE, doc, NULL, &error)) {
        printf ("Insert failed: %s\n", error.message);
    }
 
    bson_destroy (doc);
 
    doc = bson_new ();
    BSON_APPEND_OID (doc, "_id", &oid);
 
    if (!mongoc_collection_delete (collection, MONGOC_DELETE_SINGLE_REMOVE, doc, NULL, &error)) {
        printf ("Delete failed: %s\n", error.message);
    }
 
    bson_destroy (doc);
    mongoc_collection_destroy (collection);
    mongoc_client_destroy (client);
 
    return 0;
}

mongoc1.2.x

vagrant@precise64:~/mongo-c-driver$ time ./uds
Insert failed: Timed out trying to select a server
Delete failed: Timed out trying to select a server
 
real	3m0.011s

mongoc1.1.x resulted in correct:

015/04/07 18:32:18.0622: [10104]:    DEBUG:      cluster: Client initialized in direct mode.
2015/04/07 18:32:18.0623: [10104]:  WARNING:       client: Failed to connect to: ipv4 127.0.0.1:27016, error: 111, Connection refused
 
Insert failed: Failed to connect to target host: localhost:27016
2015/04/07 18:32:18.0624: [10104]:  WARNING:       client: Failed to connect to: ipv4 127.0.0.1:27016, error: 111, Connection refused
 
Delete failed: Failed to connect to target host: localhost:27016



 Comments   
Comment by Githook User [ 07/Oct/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-610 wait minHeartbeatFrequencyMS between blocking scans

Server Selection Spec:

Single-threaded server selection: When a client that uses single-threaded
monitoring fails to select a suitable server for any operation, it scans the
servers, then attempts selection again, to see if the scan discovered suitable
servers. It repeats, waiting minHeartbeatFrequencyMS between scans, until a
timeout.

https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#single-threaded-server-selection
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/38fc65a0f4b2b17d71ee71683c93aa84718a6733

Comment by Githook User [ 07/Oct/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-610 reconnect on err in single-thread mode

While looping to select a server, attempt reconnect after error.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/474dad95b895779dcef2149d862602636f44ba77

Comment by Githook User [ 07/Oct/15 ]

Author:

{u'username': u'hanumantmk', u'name': u'Jason Carey (hanumantmk)', u'email': u'jcarey@argv.me'}

Message: CDRIVER-610 remove retries in server selection

There's no need to perform retries in server selection. The topology
select takes care of requesting a scan and waiting for the alloted
timeout.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/21f69173adf793bc60e8dac8197fd5ddbc18c9cb

Comment by Githook User [ 28/Apr/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-610 wait minHeartbeatFrequencyMS between blocking scans

Server Selection Spec:

Single-threaded server selection: When a client that uses single-threaded
monitoring fails to select a suitable server for any operation, it scans the
servers, then attempts selection again, to see if the scan discovered suitable
servers. It repeats, waiting minHeartbeatFrequencyMS between scans, until a
timeout.

https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#single-threaded-server-selection
Branch: 1.2.0-dev
https://github.com/mongodb/mongo-c-driver/commit/38fc65a0f4b2b17d71ee71683c93aa84718a6733

Comment by Githook User [ 28/Apr/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-610 reconnect on err in single-thread mode

While looping to select a server, attempt reconnect after error.
Branch: 1.2.0-dev
https://github.com/mongodb/mongo-c-driver/commit/474dad95b895779dcef2149d862602636f44ba77

Comment by Githook User [ 09/Apr/15 ]

Author:

{u'username': u'bjori', u'name': u'Hannes Magnusson', u'email': u'bjori@php.net'}

Message: Workaround CDRIVER-610
Branch: master
https://github.com/10gen-labs/mongo-php-driver-prototype/commit/73a510d9463c2aaa79a7b768dfe1538431ab0efc

Comment by David Golden [ 08/Apr/15 ]

The requirement of the server-selection spec is that the driver not give up until the serverSelectionTimeoutMS has elapsed, regardless of the cause. It presumes a functional topology as described by SDAM. Finding out what's happening when the program is running seems like a logging issue.

How a driver validates the seed list from the connection string is not covered by Server Selection or SDAM. (Probably that should be part of a connection string spec.) I think that drivers are free to check that hostnames resolve or that unix domain socket paths exist when processing the connection string before starting a topology manager.

Another option could be to provide users with a method to do a scan and inspect the topology that they can run at the start of an application.

Comment by Mira Carey [ 08/Apr/15 ]

bjori, Honestly, I'm not sure how else to interpret the spec.

  • We'd like to attempt an operation:
  • We enter server selection
  • No server is primary, so we starting scanning
  • We haven't actually talked to anyone, so we try to connect and fail
  • The client still wants an answer, so we scan every 1/2 a second
  • We scan until the server selection timeout is up
  • Failure in selection

If the point of server selection timeout is to wait for 30 seconds so a primary can be elected, I'm not sure what else we're actually allowed to do there...

I do agree that there should some indication of what's going on. How about I spam a bunch of warnings in the logs on every connect failure?

Comment by Bernie Hackett [ 08/Apr/15 ]

The SS/SDAM specs sound like a red herring here. Is this just a C driver bug? The driver going into a tight loop and using 100% CPU definitely isn't a requirement of the spec.

Comment by Hannes Magnusson [ 08/Apr/15 ]

mira.carey@mongodb.com If the spec says "immediately retry for 30seconds" then I suppose this works as spec'd and my issue is with the spec and not mongoc.
This doesn't feel very usefriendly - the user may never ever see any error message and will never figureout he typo'ed the hostname, while the process is taking 100% cpu and needs to killed

I'd appreciate a comment from the authors of these specs jesse david.golden craiggwilson declaring this behavior as kosher.

The cliffnotes:
Any stream failure will result in 100% cpu usage and a perceived application freeze for `30 seconds * round-trip actions` - (which easily results in minutes)

  • A SSL verification failure
  • Unresolvable hostname
  • /invalid/path/to/UDS.socket
  • No mongod' running on port
  • Wrong username:password

The only way out of the situation is kill -9 on the process - so the user will never see a failure message or be able to figure out why the application "froze".

Comment by Mira Carey [ 08/Apr/15 ]

behackett It's 1 minute of real time between start and end. I.e. clock time. If the user time or sys time are higher, I'm happy to take another look

bjori That sounds correct. By default the server selection timeout is 30 seconds. If you're doing two operations, it'll take 30 seconds to time out the first one, then 30 seconds for the second.

Looks like this all works as spec'd now?

Comment by Bernie Hackett [ 08/Apr/15 ]

bjori@mongodb.com, you say the driver goes into a tight loop and takes 100% CPU? If so, that must be a bug.

Comment by Hannes Magnusson [ 08/Apr/15 ]

Down to 1 minute after latest commit:

$ time ./uds 
Insert failed: Timed out trying to select a server
Delete failed: Timed out trying to select a server
 
real	1m0.011s

Comment by Githook User [ 08/Apr/15 ]

Author:

{u'username': u'hanumantmk', u'name': u'Jason Carey (hanumantmk)', u'email': u'jcarey@argv.me'}

Message: CDRIVER-610 remove retries in server selection

There's no need to perform retries in server selection. The topology
select takes care of requesting a scan and waiting for the alloted
timeout.
Branch: 1.2.0-dev
https://github.com/mongodb/mongo-c-driver/commit/21f69173adf793bc60e8dac8197fd5ddbc18c9cb

Comment by A. Jesse Jiryu Davis [ 08/Apr/15 ]

It's not endless, but it is too long. At least one bug is that MAX_RETRY_COUNT=3 is still in effect; retries were removed in the server selection spec and should be removed from the driver:

https://github.com/mongodb/specifications/blob/master/source/server-selection/server-selection.rst#what-happened-to-auto-retry

Comment by Hannes Magnusson [ 08/Apr/15 ]

Likely related to CDRIVER-594

Generated at Wed Feb 07 21:10:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.