Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Component/s: FaaS, SDAM
Labels:
None

Driver Changes:
Not Needed

Driver Compliance:

$i18n.getText("admin.common.words.hide")

Key	Status/Resolution	FixVersion
CDRIVER-4492	Won't Do
CXX-2593	Won't Do
CSHARP-4352	Won't Do
GODRIVER-2577	Fixed	1.12.0, 1.9.4, 1.10.5, 1.11.1, 1.12.0-alpha1
JAVA-4760	Won't Do
NODE-4695	Won't Do
MOTOR-1043	Won't Do
PYTHON-3463	Won't Do
PHPLIB-1005	Won't Do
RUBY-3151	Won't Do
RUST-1500	Won't Do
SWIFT-1649	Won't Do

$i18n.getText("admin.common.words.show")

#scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-4492 Won't Do CXX-2593 Won't Do CSHARP-4352 Won't Do GODRIVER-2577 Fixed 1.12.0, 1.9.4, 1.10.5, 1.11.1, 1.12.0-alpha1 JAVA-4760 Won't Do NODE-4695 Won't Do MOTOR-1043 Won't Do PYTHON-3463 Won't Do PHPLIB-1005 Won't Do RUBY-3151 Won't Do RUST-1500 Won't Do SWIFT-1649 Won't Do

Summary

The SDAM Monitoring spec defines the streamable hello protocol as way of having the server send hello updates as soon as there is a change or until maxAwaitTimeoutMS is reached. In FAAS (functions as a service) environments process execution is frozen, so the driver cannot consume hello responses being sent every maxAwaitTimeoutMS. When the FAAS environment wakes up the driver must process every heartbeat that is waiting on the socket to be read. This causes performance delays compared to typical environments.

An associated bug that the Node.js driver encountered specifically was that the FAAS environment allows timers continue until expiration between invocations but it keeps execution frozen. Once the FAAS wakes up the Node.js driver processed socket timeout errors prior to reading from the socket. This order of operations is inherent to the Node.js environment, timers always come first in the event loop, but it is an indicator of potentially an additional issue with streaming in environments where the timeout execution time cannot be considered reliable.

We were able to solve this issue in Node.js by enforcing timeout errors to be handled after allowing the runtime to read from the socket. If the read succeeds then we were able to clear the erroneous timeout error, otherwise the timeout error is handled as normal. This required ordering could be worth encoding in the spec as part of fixing this related issue.

Motivation

Who is the affected end user?

FAAS users.

How does this affect the end user?

Performance concerns, or out of date TopologyDescription.

How likely is it that this problem or use case will occur?

Main path. The bug is not a blocker, it will occur consistently on every invocation. The wider the gap between invocations in relation to the heartBeatFrequencyMS setting the larger the number of heartbeats that need to be processed.

If the problem does occur, what are the consequences and how severe are they?

FAAS environments are usually designed around charging per execution, factoring in CPU time and memory usage. The common potential for heartbeats to pile up on the socket has an impact on these metrics.
The issue is mitigated by the limits imposed by TCP flow control (eventually the send and receive buffers fill up), but still can result in thousands of hello responses needing to be processed.

Is this issue urgent?

I think investigating a solution has "Major" (from JIRA) priority. There's been some proposals to consider adding a knob that forces the driver into polling mode but that comes with its own downsides (out of date TopologyDescription). Implementing the decided upon solution's priority can be considered on a per driver basis.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No.

duplicates

DRIVERS-2578 Switch to polling monitoring when running within a FaaS environment

Implementing

is related to

PYTHON-3186 AWS Lambda/FaaS pause and resume behavior causes SDAM heartbeats to timeout

Closed

NODE-4783 find() query stucks when primary switches back after stepDown() period is finished

Closed

related to

NODE-3810 AWS Lambda: MongoDB heartbeat failure.

Closed

DRIVERS-1598 Solve for serverless/lambda connection pool issues

Development Complete

split to

CDRIVER-4492 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

CSHARP-4352 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

CXX-2593 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

GODRIVER-2577 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

MOTOR-1043 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

NODE-4695 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

PHPLIB-1005 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

PYTHON-3463 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

RUBY-3151 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

RUST-1500 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

JAVA-4760 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

Closed

(11 split to)

Assignee:: Unassigned
Reporter:: Neal Beeken
Votes:: 0 Vote for this issue
Watchers:: 16 Start watching this issue

Created:: Mar 25 2022 08:32:55 PM UTC
Updated:: Mar 24 2023 01:21:23 PM UTC
Resolved:: Mar 24 2023 01:21:23 PM UTC

Details

Description

Summary

Motivation

Who is the affected end user?

How does this affect the end user?

How likely is it that this problem or use case will occur?

If the problem does occur, what are the consequences and how severe are they?

Is this issue urgent?

Is this ticket required by a downstream team?

Is this ticket only for tests?

Attachments

Issue Links

Forms

Activity

People

Dates