[DRIVERS-2458] Add test that monitors do not create excessive connections during quiesce mode Created: 30/Sep/22  Updated: 26/Jun/23

Status: Backlog
Project: Drivers
Component/s: SDAM
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Patrick Freed Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to JAVA-4743 High monitor connection churn during ... Closed
Driver Changes: Needed

 Description   

Summary

We currently lack tests for the behavior of the driver when a server is shutting down / in quiesce mode. In particular, we lack a case verifying that driver monitors do not repeatedly attempt to create new connections to the server when it is in quiesce mode. This bug is easy to introduce, and has been observed in the Java driver already (see JAVA-4743 and HELP-37852). We should add a test case for this to ensure other drivers are unaffected.

Motivation

Who is the affected end user?

Driver authors, and potentially users if a bug is discovered.

How does this affect the end user?

If a driver is affected by this bug, it will create and close a high number of connections for the duration of the quiesce period. This can contribute to connection storms and cluster instability.

One customer was affected by this bug in the Java driver. See the above HELP ticket.

How likely is it that this problem or use case will occur?

If the driver is affected by this bug, then it will occur any time the server enters quiesce mode (i.e. every time it shuts down, so most planned and unplanned maintenance events).

If the problem does occur, what are the consequences and how severe are they?

The driver will make a large amount of connections. The consequences of this may vary, but it could cause application performance degradation, network performance degradation, and potentially crashing the quiescing node.

Is this issue urgent?

Fixing the bug is urgent if a driver is affected by it. The test helps identify whether or not that is the case.

Is this ticket required by a downstream team?

No

Is this ticket only for tests?

Yes, unless a bug is discovered.

An example prose test could enable a failpoint on hello with a ShutdownInProgress error. Then create a client with heartbeatFrequencyMS = 500, sleep for 2 seconds, and assert that the number of observed heartbeatFailedEvents is between 3 and 5.



 Comments   
Comment by Neal Beeken [ 06/Oct/22 ]

Triage Notes: we have a similar test:
https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst#monitors-sleep-at-least-minheartbeatfrequencyms-between-checks
This would cover this case if we just add a skip condition to the fail point so the first few hellos succeed and then start failing. We should use the quiesce mode error code to target this case but any error code would work.

Generated at Thu Feb 08 08:25:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.