[DRIVERS-2386] Improve reliability of SDAM heartbeat error spec tests. Created: 08/Jul/22  Updated: 04/Aug/22

Status: Implementing
Project: Drivers
Component/s: SDAM
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Matt Dale Assignee: Matt Dale
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to CDRIVER-4426 Improve reliability of SDAM heartbeat... Backlog
split to CXX-2544 Improve reliability of SDAM heartbeat... Backlog
split to NODE-4504 Improve reliability of SDAM heartbeat... Backlog
split to CSHARP-4252 Improve reliability of SDAM heartbeat... Closed
split to GODRIVER-2490 Improve reliability of SDAM heartbeat... Closed
split to JAVA-4677 Improve reliability of SDAM heartbeat... Closed
split to MOTOR-993 Improve reliability of SDAM heartbeat... Closed
split to NODE-4414 Improve reliability of SDAM heartbeat... Closed
split to PHPLIB-910 Improve reliability of SDAM heartbeat... Closed
split to PYTHON-3353 Improve reliability of SDAM heartbeat... Closed
split to RUBY-3050 Improve reliability of SDAM heartbeat... Closed
split to RUST-1407 Improve reliability of SDAM heartbeat... Closed
Related
is related to GODRIVER-2464 Add timeout for RTT monitor "hello" o... Closed
Driver Changes: Needed
Downstream Changes Summary:

Sync SDAM integration spec tests at revision 98e20daa.

Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4426 Backlog
CXX-2544 Backlog
CSHARP-4252 Done 2.18.0
GODRIVER-2490 Done
JAVA-4677 Fixed 4.7.0
NODE-4414 Fixed 4.9.0
MOTOR-993 Duplicate
PYTHON-3353 Fixed 4.2
PHPLIB-910 Won't Do
RUBY-3050 Fixed 2.18.1
RUST-1407 Duplicate
SWIFT-1601 Duplicate
NODE-4504 Backlog

 Description   

Summary

There are 3 SDAM spec tests that assert that the server monitor handles check (aka. heartbeat) errors after the initial handshake correctly:

Each of those tests registers a failpoint for "hello" that causes a specific type of error. The failpoint uses "times: 2" because both the server monitor and RTT monitor send "hello" operations to the server at roughly the same interval, so it's expected that both monitors may trigger a failpoint. However, sometimes the RTT monitor runs twice and triggers the failpoint 2 times before the server monitor runs again, leading to a test failure because the server monitor heartbeat never triggers a failpoint.

That can happen because the server monitor and RTT monitor run concurrently and use different timing mechanisms when awaitable "hello" is available. The server monitor using awaitable "hello" depends partially on server-side timing via "maxAwaitTimeMS" (see description here), while the RTT monitor timing is strictly driver-side (see description here). As a result, it's possible for the RTT monitor to run more than once before an in-progress awaitable "hello" returns and attempts to start a new "hello" that would trigger the failpoint.

We can significantly reduce the probability of intermittent failures by increasing the number of times the failpoint can be triggered. If we do that, we also need to remove the assertion that exactly 1 "ServerMarkedUnknownEvent" and "PoolClearedEvent" events are fired (already done in the "Network error on Monitor check" spec test) because the server monitor would have a higher probability of triggering more than 1 failpoint.

Motivation

Who is the affected end user?

DBX devs.

How does this affect the end user?

The SDAM Command error on Monitor check, Network error on Monitor check, or Network timeout on Monitor check spec tests fail intermittently.

How likely is it that this problem or use case will occur?

The failure is caused by a race between the server monitor heartbeat loop and the RTT monitor loop. Depending on runtime conditions, the RTT monitor loop may run twice before the server monitor heartbeat loop runs once after the failpoint is registered. The observed failure rate in the Go Driver is around 5-10% if run individually, or around 1-2% if run with the rest of the test suite.

If the problem does occur, what are the consequences and how severe are they?

Pull request or waterfall Evergreen CI test runs may fail intermittently, leading to "false positive" test failures that create confusion, take time to troubleshoot, and possibly hide actual errors that are misinterpreted as errors due to a flaky test.

Is this issue urgent?

No.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

Yes.



 Comments   
Comment by Githook User [ 13/Jul/22 ]

Author:

{'name': 'Matt Dale', 'email': '9760375+matthewdale@users.noreply.github.com', 'username': 'matthewdale'}

Message: DRIVERS-2386 Improve reliability of SDAM heartbeat error spec tests. (#1272)
Branch: master
https://github.com/mongodb/specifications/commit/98e20daa7e1bdf8f4acf391376531c477990bd3c

Comment by Matt Dale [ 08/Jul/22 ]

Proposed test changes: https://github.com/mongodb/specifications/pull/1272
Tested spec test changes in the Go Driver: https://github.com/mongodb/mongo-go-driver/pull/1018

Generated at Thu Feb 08 08:25:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.