[CDRIVER-1571] Spurious topology-scanner timeouts when using stream initiator Created: 30/Sep/16 Updated: 20/Oct/16 Resolved: 05/Oct/16 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | libmongoc, network |
| Affects Version/s: | 1.2.0 |
| Fix Version/s: | 1.3.6, 1.5.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Async "ismaster" commands started before a slow call to a blocking stream initiator have the initiator's duration subtracted from their timeouts. The initiator and the async commands use the same initial timeout value, so once an initiator times out, all async commands started beforehand now have 0 seconds remaining. They're canceled before they run. Diagnosed with the PHP Driver 1.1.8 and C Driver 1.3.5. The new PHP Driver will no longer use custom stream initiators. Another library could theoretically use them and hit this bug. Scenario: create a single client with URI "mongodb://host1,host2/?replicaSet=rs". host1 is up, host2 is unresponsive. With mongoc_client_set_stream_initiator set a custom stream initiator that does a blocking initial connect. Begin an application operation to trigger server selection and a topology scan. In mongoc_topology_scanner_start:
Proceed to _mongoc_topology_run_scanner:
The C Driver's error is like:
The part of the error message about host1 comes from the topology scanner, it's created when the async command is canceled. The part about host2 comes from the custom stream initiator, in this case I've copied the error message formatted by the PHP Driver's custom stream initiator. In the particular scenario that led to me diagnosing this bug, the C Driver also suffered from The solution is to stop tracking per-command timeouts; they're intended for a future full-async driver that we've decided not to implement. We already track a timeout for _mongoc_topology_run_scanner, which does what we need now. |
| Comments |
| Comment by Githook User [ 12/Oct/16 ] |
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: mongoc_async_run() no longer returns "true" if there is more work to do: |
| Comment by Githook User [ 12/Oct/16 ] |
|
Author: {u'username': u'bjori', u'name': u'Hannes Magnusson', u'email': u'bjori@php.net'}Message: |
| Comment by Githook User [ 12/Oct/16 ] |
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: Async ismaster commands started before a slow call to a blocking stream This change relies on the overall timeout applied to mongoc_async_run |
| Comment by Githook User [ 05/Oct/16 ] |
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: mongoc_async_run() no longer returns "true" if there is more work to do: |
| Comment by Jeremy Mikola [ 04/Oct/16 ] |
|
jesse: Last week, another user reported an issue with what appeared to be unexpected timeouts (incorrectly reported as "connection errors") on available nodes when a passive node (not in the URI seed list) was unavailable (see: mongodb/mongo-php-driver#425). In this case, the user appeared to be trying to select a secondary server ("secondaryPreferred" read preference); however, I expect at least one node in the seed list would need to have responded to an isMaster command in order for the topology to discover the passive node. Do you suppose this is related? At the very least, it may make for another test case to add for the fix. |
| Comment by Githook User [ 04/Oct/16 ] |
|
Author: {u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}Message: Async ismaster commands started before a slow call to a blocking stream This change relies on the overall timeout applied to mongoc_async_run |
| Comment by A. Jesse Jiryu Davis [ 30/Sep/16 ] |
|
Yeah if you read the "scenario" above it shows why order matters. Async commands' expiration times are calculated based on when the commands are created. |
| Comment by Jeremy Mikola [ 30/Sep/16 ] |
|
jesse: In my test to demonstrate the bug, I was using a one second connection timeout. Am I correct in assuming that the async commands are using the same timeout and that is one one failed, blocking connect attempt is enough to cause each async command to prematurely fail? Were you able to determine why the order of nodes in the seed list was significant? That is, async commands only fail for hosts listed before the unavailable host. |