-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
Summary
It is possible that command execution with a timeout may exceed their set timeout, depending on network conditions and server behavior.
The issue presents an obstacle to CSOT as at this lowest level operational timeouts cannot be fully enforced.
Environment
This issue was introduced with the fix to CDRIVER-1571
Additional Background
The fix that was implemented for CDRIVER-1571 was to reset the timeout associated with async command objects at certain points in their execution. Because of this, an operation may actually take extra time to expire. This prevents user timeouts from being honored. For every N times the timeout is reset, the result is that a command may take N times as long to execute as its requested timeout.
The reason the fix was needed is that stream connection and stream setup can be customized by users, and may be blocking operations, while the async command runner itself only has a single thread, so any blocking operation in any step will block all operations in the async pool. This means that if one single step takes longer than T to complete, all other commands in the pool would see this as a delay of T for whatever step they are currently paused within. To work around this, the async runner was modified to restart the timer on operations at certain points in their async loop.
Additionally: This fix is potentially insufficient, since it doesn't reset the timeout enough to actually prevent the problem in all cases (e.g. the timeout isn't reset between chunked reads/writes, meaning that another stream blocking can cause the unrelated read/write to timeout). Fixing this by adding more timer-resets would just make the above problem worse, though.
In general, performing even a single non-deterministic blocking operation within any async event loop will crystalize and ruin the entire event loop for everyone.
The proper fix is to support (and require) that the user-provided I/O functions be non-blocking. To better support this will require that the async command runner make it easy for users to provide non-blocking stream callbacks. The runner does seem to have support for non-blocking operations, but it's not mandated. It may be possible to detect-and-warn on blocking operations if a user callback takes more time than is reasonable (e.g. 250ms?).
- related to
-
CDRIVER-1571 Spurious topology-scanner timeouts when using stream initiator
-
- Closed
-