[CDRIVER-610] Client should retry every .5 sec until server selection timeout Created: 08/Apr/15 Updated: 08/Jan/24 Resolved: 28/Apr/15 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | None |
| Affects Version/s: | 1.2.0 |
| Fix Version/s: | 1.2-beta0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Hannes Magnusson | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
*Summary (Jesse)*: this was two bugs in the single-threaded implementation of the Server Discovery And Monitoring and Server Selection specs. First, after failing an initial connection, the client never re-attempted connection. Second, it spun in a tight loop until the server selection timeout (default 30 seconds) expired. The fix is to actually re-attempt a connection after each connection failure (until the timeout), and to pause half a second between attempts (the minHeartbeatFrequencyMS in Server Discovery And Monitoring). *Original Report (Hannes)*: Straight from the example docs - modifying the port to not-a-mongod will result in an endless loop.
mongoc1.2.x
mongoc1.1.x resulted in correct:
|
| Comments |
| Comment by Githook User [ 07/Oct/15 ] | |||||
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: Server Selection Spec: Single-threaded server selection: When a client that uses single-threaded https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#single-threaded-server-selection | |||||
| Comment by Githook User [ 07/Oct/15 ] | |||||
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: While looping to select a server, attempt reconnect after error. | |||||
| Comment by Githook User [ 07/Oct/15 ] | |||||
|
Author: {u'username': u'hanumantmk', u'name': u'Jason Carey (hanumantmk)', u'email': u'jcarey@argv.me'}Message: There's no need to perform retries in server selection. The topology | |||||
| Comment by Githook User [ 28/Apr/15 ] | |||||
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: Server Selection Spec: Single-threaded server selection: When a client that uses single-threaded https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#single-threaded-server-selection | |||||
| Comment by Githook User [ 28/Apr/15 ] | |||||
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: While looping to select a server, attempt reconnect after error. | |||||
| Comment by Githook User [ 09/Apr/15 ] | |||||
|
Author: {u'username': u'bjori', u'name': u'Hannes Magnusson', u'email': u'bjori@php.net'}Message: Workaround | |||||
| Comment by David Golden [ 08/Apr/15 ] | |||||
|
The requirement of the server-selection spec is that the driver not give up until the serverSelectionTimeoutMS has elapsed, regardless of the cause. It presumes a functional topology as described by SDAM. Finding out what's happening when the program is running seems like a logging issue. How a driver validates the seed list from the connection string is not covered by Server Selection or SDAM. (Probably that should be part of a connection string spec.) I think that drivers are free to check that hostnames resolve or that unix domain socket paths exist when processing the connection string before starting a topology manager. Another option could be to provide users with a method to do a scan and inspect the topology that they can run at the start of an application. | |||||
| Comment by Mira Carey [ 08/Apr/15 ] | |||||
|
bjori, Honestly, I'm not sure how else to interpret the spec.
If the point of server selection timeout is to wait for 30 seconds so a primary can be elected, I'm not sure what else we're actually allowed to do there... I do agree that there should some indication of what's going on. How about I spam a bunch of warnings in the logs on every connect failure? | |||||
| Comment by Bernie Hackett [ 08/Apr/15 ] | |||||
|
The SS/SDAM specs sound like a red herring here. Is this just a C driver bug? The driver going into a tight loop and using 100% CPU definitely isn't a requirement of the spec. | |||||
| Comment by Hannes Magnusson [ 08/Apr/15 ] | |||||
|
mira.carey@mongodb.com If the spec says "immediately retry for 30seconds" then I suppose this works as spec'd and my issue is with the spec and not mongoc. I'd appreciate a comment from the authors of these specs jesse david.golden craiggwilson declaring this behavior as kosher. The cliffnotes:
The only way out of the situation is kill -9 on the process - so the user will never see a failure message or be able to figure out why the application "froze". | |||||
| Comment by Mira Carey [ 08/Apr/15 ] | |||||
|
behackett It's 1 minute of real time between start and end. I.e. clock time. If the user time or sys time are higher, I'm happy to take another look bjori That sounds correct. By default the server selection timeout is 30 seconds. If you're doing two operations, it'll take 30 seconds to time out the first one, then 30 seconds for the second. Looks like this all works as spec'd now? | |||||
| Comment by Bernie Hackett [ 08/Apr/15 ] | |||||
|
bjori@mongodb.com, you say the driver goes into a tight loop and takes 100% CPU? If so, that must be a bug. | |||||
| Comment by Hannes Magnusson [ 08/Apr/15 ] | |||||
|
Down to 1 minute after latest commit:
| |||||
| Comment by Githook User [ 08/Apr/15 ] | |||||
|
Author: {u'username': u'hanumantmk', u'name': u'Jason Carey (hanumantmk)', u'email': u'jcarey@argv.me'}Message: There's no need to perform retries in server selection. The topology | |||||
| Comment by A. Jesse Jiryu Davis [ 08/Apr/15 ] | |||||
|
It's not endless, but it is too long. At least one bug is that MAX_RETRY_COUNT=3 is still in effect; retries were removed in the server selection spec and should be removed from the driver: | |||||
| Comment by Hannes Magnusson [ 08/Apr/15 ] | |||||
|
Likely related to |