Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2386

Improve reliability of SDAM heartbeat error spec tests.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Implementing
    • Unknown
    • Resolution: Unresolved
    • None
    • SDAM
    • None

    Description

      Summary

      There are 3 SDAM spec tests that assert that the server monitor handles check (aka. heartbeat) errors after the initial handshake correctly:

      Each of those tests registers a failpoint for "hello" that causes a specific type of error. The failpoint uses "times: 2" because both the server monitor and RTT monitor send "hello" operations to the server at roughly the same interval, so it's expected that both monitors may trigger a failpoint. However, sometimes the RTT monitor runs twice and triggers the failpoint 2 times before the server monitor runs again, leading to a test failure because the server monitor heartbeat never triggers a failpoint.

      That can happen because the server monitor and RTT monitor run concurrently and use different timing mechanisms when awaitable "hello" is available. The server monitor using awaitable "hello" depends partially on server-side timing via "maxAwaitTimeMS" (see description here), while the RTT monitor timing is strictly driver-side (see description here). As a result, it's possible for the RTT monitor to run more than once before an in-progress awaitable "hello" returns and attempts to start a new "hello" that would trigger the failpoint.

      We can significantly reduce the probability of intermittent failures by increasing the number of times the failpoint can be triggered. If we do that, we also need to remove the assertion that exactly 1 "ServerMarkedUnknownEvent" and "PoolClearedEvent" events are fired (already done in the "Network error on Monitor check" spec test) because the server monitor would have a higher probability of triggering more than 1 failpoint.

      Motivation

      Who is the affected end user?

      DBX devs.

      How does this affect the end user?

      The SDAM Command error on Monitor check, Network error on Monitor check, or Network timeout on Monitor check spec tests fail intermittently.

      How likely is it that this problem or use case will occur?

      The failure is caused by a race between the server monitor heartbeat loop and the RTT monitor loop. Depending on runtime conditions, the RTT monitor loop may run twice before the server monitor heartbeat loop runs once after the failpoint is registered. The observed failure rate in the Go Driver is around 5-10% if run individually, or around 1-2% if run with the rest of the test suite.

      If the problem does occur, what are the consequences and how severe are they?

      Pull request or waterfall Evergreen CI test runs may fail intermittently, leading to "false positive" test failures that create confusion, take time to troubleshoot, and possibly hide actual errors that are misinterpreted as errors due to a flaky test.

      Is this issue urgent?

      No.

      Is this ticket required by a downstream team?

      No.

      Is this ticket only for tests?

      Yes.

      Attachments

        Issue Links

          Activity

            People

              matt.dale@mongodb.com Matt Dale
              matt.dale@mongodb.com Matt Dale
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: