The C Driver's topology scanner has a hard limit of connectTimeoutMS that spans the period between calling mongoc_async_run to begin discovering, and when mongoc_async_run finishes. Some requests might begin in the middle of the call to mongoc_async_run, because we start checking a server newly discovered in the host list of an isMaster response immediately, but we don't reset the timer when we do so.
This ignores connectTimeoutMS's purpose: it's not for limiting the duration of mongoc_async_run, it's for distinguishing between a server that's slow to respond vs. a server that's down.
Track the duration of server checks per-server, instead of for the entire event loop. Let the loop run a bit longer to give newly discovered servers a fair timeout.
Note, mongoc_async_run's original design tried to do this. Then it had an awkward phase when it enforced timeouts both per-server and for the entire function call. Then I rewrote it to only enforce the timeout for the whole function. That was a mistake, instead we should get rid of the whole-function timeout and only track timeouts per-server.
- related to
-
PHPC-723 Blocking connect() leads to cumulative timeouts for multiple inaccessible servers
- Closed