[CDRIVER-631] Timeout for reconnects is longer than would be expected Created: 01/May/15  Updated: 19/Oct/16  Resolved: 21/May/15

Status: Closed
Project: C Driver
Component/s: None
Affects Version/s: 1.1.4
Fix Version/s: 1.1.6, 1.1.7

Type: Bug Priority: Major - P3
Reporter: Mira Carey Assignee: A. Jesse Jiryu Davis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends

 Description   

It can take a surprising amount of time before we report primary not available when connecting to a replica set with two downed members.

In particular, we have to try:

(1(initial connect) + 3(tries)) * 2(non-responsive members in seed list) * 2(non-responsive members discovered from live member) * N(connecttimeoutms) = a maximum of 16 * N seconds before returning.

With a timeout of 3 seconds:

2015/05/01 14:39:14.0873: [15700]:    DEBUG:      cluster: Client initialized in replica set mode.
2015/05/01 14:39:14.0873: [15700]:    DEBUG:      cluster: Reconnecting to replica set.
2015/05/01 14:39:17.0876: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:39:20.0879: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:39:20.0883: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:39:20.0883: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:39:20.0883: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
2015/05/01 14:39:23.0889: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:39:26.0894: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:39:26.0901: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:39:26.0901: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:39:26.0901: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
2015/05/01 14:39:26.0902: [15700]:    DEBUG:      cluster: Reconnecting to replica set.
2015/05/01 14:39:29.0904: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:39:32.0907: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:39:32.0911: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:39:32.0911: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:39:32.0911: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
2015/05/01 14:39:35.0916: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:39:38.0922: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:39:38.0928: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:39:38.0928: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:39:38.0928: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
2015/05/01 14:39:38.0929: [15700]:    DEBUG:      cluster: Reconnecting to replica set.
2015/05/01 14:39:41.0932: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:39:44.0935: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:39:44.0938: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:39:44.0938: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:39:44.0938: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
2015/05/01 14:39:47.0944: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:39:50.0949: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:39:50.0956: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:39:50.0956: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:39:50.0956: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
2015/05/01 14:39:50.0956: [15700]:    DEBUG:      cluster: Reconnecting to replica set.
2015/05/01 14:39:53.0959: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:39:56.0962: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:39:56.0966: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:39:56.0966: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:39:56.0966: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
2015/05/01 14:39:59.0971: [15700]:  WARNING:      cluster: Failed connection to localhost:30001
2015/05/01 14:40:02.0975: [15700]:  WARNING:      cluster: Failed connection to localhost:30002
2015/05/01 14:40:02.0982: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30003
2015/05/01 14:40:02.0982: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30002
2015/05/01 14:40:02.0982: [15700]:    DEBUG:      cluster: Registering potential peer: localhost:30001
Cursor Failure: Requested PRIMARY node is not available.



 Comments   
Comment by A. Jesse Jiryu Davis [ 21/May/15 ]

https://github.com/mongodb/mongo-c-driver/commit/d4a56e7e6

Comment by Githook User [ 18/May/15 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-631 reduce max retries from 3 to 1

It can take a surprising amount of time before we report primary not available
when connecting to a replica set with two downed members. With
socketTimeoutMS = connectionTimeoutMS = 1 second, it can take up to 16 seconds
to return an error from an operation.

Reducing retries to 1 speeds this scenario to bail after 8 seconds, not 16.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/0e5a5a6bf315ab2a11a3d3915e4ddf948e91873b

Comment by A. Jesse Jiryu Davis [ 04/May/15 ]

I'll verify that this works as designed now that the C Driver implements the Server Discovery And Monitoring Spec in 1.2.0. Certainly the three retries have been removed, and serverSelectionTimeoutMS is now configurable separately from connectTimeoutMS, so a client application can decide exactly how long it is willing to block awaiting reconnection.

This bug can't be fixed in the 1.1.x line without breaking compatibility, and without an unacceptable amount of rewriting.

Generated at Wed Feb 07 21:10:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.