Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2386

Improve reliability of SDAM heartbeat error spec tests.

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: SDAM
    • Labels:
      None
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-4426 Backlog
      CXX-2544 Backlog
      CSHARP-4252 Done 2.18.0
      GODRIVER-2490 Done
      JAVA-4677 Fixed 4.7.0
      NODE-4414 Fixed 4.9.0
      MOTOR-993 Duplicate
      PYTHON-3353 Fixed 4.2
      PHPLIB-910 Won't Do
      RUBY-3050 Fixed 2.18.1
      RUST-1407 Duplicate
      SWIFT-1601 Duplicate
      NODE-4504 Backlog
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-4426 Backlog CXX-2544 Backlog CSHARP-4252 Done 2.18.0 GODRIVER-2490 Done JAVA-4677 Fixed 4.7.0 NODE-4414 Fixed 4.9.0 MOTOR-993 Duplicate PYTHON-3353 Fixed 4.2 PHPLIB-910 Won't Do RUBY-3050 Fixed 2.18.1 RUST-1407 Duplicate SWIFT-1601 Duplicate NODE-4504 Backlog

      Summary

      There are 3 SDAM spec tests that assert that the server monitor handles check (aka. heartbeat) errors after the initial handshake correctly:

      Each of those tests registers a failpoint for "hello" that causes a specific type of error. The failpoint uses "times: 2" because both the server monitor and RTT monitor send "hello" operations to the server at roughly the same interval, so it's expected that both monitors may trigger a failpoint. However, sometimes the RTT monitor runs twice and triggers the failpoint 2 times before the server monitor runs again, leading to a test failure because the server monitor heartbeat never triggers a failpoint.

      That can happen because the server monitor and RTT monitor run concurrently and use different timing mechanisms when awaitable "hello" is available. The server monitor using awaitable "hello" depends partially on server-side timing via "maxAwaitTimeMS" (see description here), while the RTT monitor timing is strictly driver-side (see description here). As a result, it's possible for the RTT monitor to run more than once before an in-progress awaitable "hello" returns and attempts to start a new "hello" that would trigger the failpoint.

      We can significantly reduce the probability of intermittent failures by increasing the number of times the failpoint can be triggered. If we do that, we also need to remove the assertion that exactly 1 "ServerMarkedUnknownEvent" and "PoolClearedEvent" events are fired (already done in the "Network error on Monitor check" spec test) because the server monitor would have a higher probability of triggering more than 1 failpoint.

      Motivation

      Who is the affected end user?

      DBX devs.

      How does this affect the end user?

      The SDAM Command error on Monitor check, Network error on Monitor check, or Network timeout on Monitor check spec tests fail intermittently.

      How likely is it that this problem or use case will occur?

      The failure is caused by a race between the server monitor heartbeat loop and the RTT monitor loop. Depending on runtime conditions, the RTT monitor loop may run twice before the server monitor heartbeat loop runs once after the failpoint is registered. The observed failure rate in the Go Driver is around 5-10% if run individually, or around 1-2% if run with the rest of the test suite.

      If the problem does occur, what are the consequences and how severe are they?

      Pull request or waterfall Evergreen CI test runs may fail intermittently, leading to "false positive" test failures that create confusion, take time to troubleshoot, and possibly hide actual errors that are misinterpreted as errors due to a flaky test.

      Is this issue urgent?

      No.

      Is this ticket required by a downstream team?

      No.

      Is this ticket only for tests?

      Yes.

            Assignee:
            matt.dale@mongodb.com Matt Dale
            Reporter:
            matt.dale@mongodb.com Matt Dale
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: