[DRIVERS-2246] Heartbeat build up with streaming protocol when driver process is stopped (FAAS) Created: 25/Mar/22  Updated: 24/Mar/23  Resolved: 24/Mar/23

Status: Closed
Project: Drivers
Component/s: FaaS, SDAM
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Neal Beeken Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates DRIVERS-2578 Switch to polling monitoring when run... Implementing
Issue split
split to CDRIVER-4492 Heartbeat build up with streaming pro... Closed
split to CSHARP-4352 Heartbeat build up with streaming pro... Closed
split to CXX-2593 Heartbeat build up with streaming pro... Closed
split to GODRIVER-2577 Heartbeat build up with streaming pro... Closed
split to MOTOR-1043 Heartbeat build up with streaming pro... Closed
split to NODE-4695 Heartbeat build up with streaming pro... Closed
split to PHPLIB-1005 Heartbeat build up with streaming pro... Closed
split to PYTHON-3463 Heartbeat build up with streaming pro... Closed
split to RUBY-3151 Heartbeat build up with streaming pro... Closed
split to RUST-1500 Heartbeat build up with streaming pro... Closed
split to JAVA-4760 Heartbeat build up with streaming pro... Closed
Related
related to DRIVERS-1598 Solve for serverless/lambda connectio... Closed
related to NODE-3810 AWS Lambda: MongoDB heartbeat failure. Closed
is related to PYTHON-3186 AWS Lambda/FaaS pause and resume beha... Closed
is related to NODE-4783 find() query stucks when primary swit... Closed
Driver Changes: Not Needed
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4492 Won't Do
CXX-2593 Won't Do
CSHARP-4352 Won't Do
GODRIVER-2577 Fixed 1.12.0, 1.9.4, 1.10.5, 1.11.1, 1.12.0-alpha1
JAVA-4760 Won't Do
NODE-4695 Won't Do
MOTOR-1043 Won't Do
PYTHON-3463 Won't Do
PHPLIB-1005 Won't Do
RUBY-3151 Won't Do
RUST-1500 Won't Do
SWIFT-1649 Won't Do

 Description   

Summary

The SDAM Monitoring spec defines the streamable hello protocol as way of having the server send hello updates as soon as there is a change or until maxAwaitTimeoutMS is reached. In FAAS (functions as a service) environments process execution is frozen, so the driver cannot consume hello responses being sent every maxAwaitTimeoutMS. When the FAAS environment wakes up the driver must process every heartbeat that is waiting on the socket to be read. This causes performance delays compared to typical environments. 

An associated bug that the Node.js driver encountered specifically was that the FAAS environment allows timers continue until expiration between invocations but it keeps execution frozen. Once the FAAS wakes up the Node.js driver processed socket timeout errors prior to reading from the socket. This order of operations is inherent to the Node.js environment, timers always come first in the event loop, but it is an indicator of potentially an additional issue with streaming in environments where the timeout execution time cannot be considered reliable.

We were able to solve this issue in Node.js by enforcing timeout errors to be handled after allowing the runtime to read from the socket. If the read succeeds then we were able to clear the erroneous timeout error, otherwise the timeout error is handled as normal. This required ordering could be worth encoding in the spec as part of fixing this related issue.

Motivation

Who is the affected end user?

FAAS users.

How does this affect the end user?

Performance concerns, or out of date TopologyDescription.

How likely is it that this problem or use case will occur?

Main path. The bug is not a blocker, it will occur consistently on every invocation. The wider the gap between invocations in relation to the heartBeatFrequencyMS setting the larger the number of heartbeats that need to be processed.

If the problem does occur, what are the consequences and how severe are they?

FAAS environments are usually designed around charging per execution, factoring in CPU time and memory usage. The common potential for heartbeats to pile up on the socket has an impact on these metrics.
The issue is mitigated by the limits imposed by TCP flow control (eventually the send and receive buffers fill up), but still can result in thousands of hello responses needing to be processed.

Is this issue urgent?

I think investigating a solution has "Major" (from JIRA) priority. There's been some proposals to consider adding a knob that forces the driver into polling mode but that comes with its own downsides (out of date TopologyDescription). Implementing the decided upon solution's priority can be considered on a per driver basis.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No.


Generated at Thu Feb 08 08:25:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.