[CDRIVER-1571] Spurious topology-scanner timeouts when using stream initiator Created: 30/Sep/16  Updated: 20/Oct/16  Resolved: 05/Oct/16

Status: Closed
Project: C Driver
Component/s: libmongoc, network
Affects Version/s: 1.2.0
Fix Version/s: 1.3.6, 1.5.0

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by CDRIVER-1682 Backport CDRIVER-1571 to 1.3 Closed
Related
related to PHPC-820 Upgrade libmongoc to 1.3.6 Closed

 Description   

Async "ismaster" commands started before a slow call to a blocking stream initiator have the initiator's duration subtracted from their timeouts. The initiator and the async commands use the same initial timeout value, so once an initiator times out, all async commands started beforehand now have 0 seconds remaining. They're canceled before they run.

Diagnosed with the PHP Driver 1.1.8 and C Driver 1.3.5. The new PHP Driver will no longer use custom stream initiators. Another library could theoretically use them and hit this bug.

Scenario: create a single client with URI "mongodb://host1,host2/?replicaSet=rs". host1 is up, host2 is unresponsive.

With mongoc_client_set_stream_initiator set a custom stream initiator that does a blocking initial connect. Begin an application operation to trigger server selection and a topology scan.

In mongoc_topology_scanner_start:

  1. Create a stream for host1
  2. Call mongoc_async_cmd_new at Time 0
  3. mongoc_async_cmd_new sets expiration for calling ismaster on host1 to Time 0 + 10 seconds
  4. Create a stream for host2 - this blocks for 10 seconds and fails
  5. The time is now after Time 0 + 10 seconds

Proceed to _mongoc_topology_run_scanner:

  1. The async command for host1 is already expired, mark it failed
  2. We never created an async command for host2 since its initiator timed out

The C Driver's error is like:

No suitable servers found (`serverselectiontryonce` set): [connection timeout calling ismaster on 'host1:27017'] [Failed connecting to 'host2:27017': Connection timed out]

The part of the error message about host1 comes from the topology scanner, it's created when the async command is canceled. The part about host2 comes from the custom stream initiator, in this case I've copied the error message formatted by the PHP Driver's custom stream initiator.

In the particular scenario that led to me diagnosing this bug, the C Driver also suffered from CDRIVER-1567 so the first part of the message wrongly said "connection error", but it should have said "connection timeout".

The solution is to stop tracking per-command timeouts; they're intended for a future full-async driver that we've decided not to implement. We already track a timeout for _mongoc_topology_run_scanner, which does what we need now.



 Comments   
Comment by Githook User [ 12/Oct/16 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-1571 simplify async command loop

mongoc_async_run() no longer returns "true" if there is more work to do:
it never returns early now since there is only one timeout for the whole
function. Also, make timeouts int64_t in all internal functions.
Branch: r1.3
https://github.com/mongodb/mongo-c-driver/commit/f39b4776635c152a93a2d943a23480577f71fb66

Comment by Githook User [ 12/Oct/16 ]

Author:

{u'username': u'bjori', u'name': u'Hannes Magnusson', u'email': u'bjori@php.net'}

Message: CDRIVER-1571 Add missing r1.3 required includes
Branch: r1.3
https://github.com/mongodb/mongo-c-driver/commit/ff1ce0ad16d628bf26f01d5b630a3f4533504faf

Comment by Githook User [ 12/Oct/16 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-1571 slow initiator cancels topology scan

Async ismaster commands started before a slow call to a blocking stream
initiator had the initiator's duration subtracted from their timeouts.
The initiator and the async commands use the same initial timeout value,
so once an initiator times out, all async commands started beforehand
now have 0 seconds remaining. They're canceled before they run.

This change relies on the overall timeout applied to mongoc_async_run
instead of per-command timeouts.
Branch: r1.3
https://github.com/mongodb/mongo-c-driver/commit/08e4cc9bf92cee2199fad2beeb4c7fd5659ae7ae

Comment by Githook User [ 05/Oct/16 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-1571 simplify async command loop

mongoc_async_run() no longer returns "true" if there is more work to do:
it never returns early now since there is only one timeout for the whole
function. Also, make timeouts int64_t in all internal functions.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/5cf7749b33d7d976e5b47979f4fc30a9c2594c71

Comment by Jeremy Mikola [ 04/Oct/16 ]

jesse: Last week, another user reported an issue with what appeared to be unexpected timeouts (incorrectly reported as "connection errors") on available nodes when a passive node (not in the URI seed list) was unavailable (see: mongodb/mongo-php-driver#425). In this case, the user appeared to be trying to select a secondary server ("secondaryPreferred" read preference); however, I expect at least one node in the seed list would need to have responded to an isMaster command in order for the topology to discover the passive node.

Do you suppose this is related? At the very least, it may make for another test case to add for the fix.

Comment by Githook User [ 04/Oct/16 ]

Author:

{u'username': u'ajdavis', u'name': u'A. Jesse Jiryu Davis', u'email': u'jesse@mongodb.com'}

Message: CDRIVER-1571 slow initiator cancels topology scan

Async ismaster commands started before a slow call to a blocking stream
initiator had the initiator's duration subtracted from their timeouts.
The initiator and the async commands use the same initial timeout value,
so once an initiator times out, all async commands started beforehand
now have 0 seconds remaining. They're canceled before they run.

This change relies on the overall timeout applied to mongoc_async_run
instead of per-command timeouts.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/a5a02711aab0e8d01e16faf7b0d63ca5219292d6

Comment by A. Jesse Jiryu Davis [ 30/Sep/16 ]

Yeah if you read the "scenario" above it shows why order matters. Async commands' expiration times are calculated based on when the commands are created.

Comment by Jeremy Mikola [ 30/Sep/16 ]

jesse: In my test to demonstrate the bug, I was using a one second connection timeout. Am I correct in assuming that the async commands are using the same timeout and that is one one failed, blocking connect attempt is enough to cause each async command to prematurely fail?

Were you able to determine why the order of nodes in the seed list was significant? That is, async commands only fail for hosts listed before the unavailable host.

Generated at Wed Feb 07 21:12:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.