Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2004

Introduce prose concurrent stress tests for the Connection Pool

    • Type: Icon: Spec Change Spec Change
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: CMAP
    • None
    • Needed
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-4509 Blocked
      CXX-2603 Blocked
      CSHARP-4372 Blocked
      GODRIVER-2593 Blocked
      JAVA-4786 Blocked
      NODE-4740 Blocked
      MOTOR-1052 Blocked
      PYTHON-3483 Blocked
      PHPLIB-1026 Won't Do
      RUBY-3162 Blocked
      RUST-1514 Blocked
      SWIFT-1664 Won't Do
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-4509 Blocked CXX-2603 Blocked CSHARP-4372 Blocked GODRIVER-2593 Blocked JAVA-4786 Blocked NODE-4740 Blocked MOTOR-1052 Blocked PYTHON-3483 Blocked PHPLIB-1026 Won't Do RUBY-3162 Blocked RUST-1514 Blocked SWIFT-1664 Won't Do

      Summary

      The existing specification tests does not adequately test the Connection Pool under an intensive concurrent usage: not only they use the Connection Pool lightly and not necessary concurrently, they also try to achieve the same predictable execution in order to observe a specific set of events in a specific order. The concurrent stress tests use a different approach: they try to subject the object under test to an intensive variable usage in order to produce many different execution paths, including otherwise rare ones. This is possible because the assertions in concurrent stress tests are simpler than those in usual tests: they usually expect the absence of unexpected behaviors, e.g., exceptions or dead locks, and potentially compliance with very basic specified guarantees, like not returning null values from non-null methods.

      Motivation

      Who is the affected end user?

      Driver engineers.

      How does this affect the end user?

      Having such tests increases the probability of a driver engineer discovering a bug in the Connection Pool before releasing the changes to driver users.

      How likely is it that this problem or use case will occur?

      It depends on the complexity of the changes to the Connection Pool. While working on "Avoiding connection storms" (JAVA-3890), the concurrent tests allowed me to discover and fix multiple bug that were not caught by existing specification tests. boris.dogadov reported similar experience.

      If the problem does occur, what are the consequences and how severe are they?

      If we miss a concurrency bug in the Connection Pool, it may cause serious problems in the application that uses the driver, e.g., dead locks, memory leaks. It is impossible to tell more concretely.

      Is this issue urgent?

      No.

      Is this ticket required by a downstream team?

      No.

      Is this ticket only for tests?

      Yes.

      Details

      I am providing an overview of what the Java MongoDB driver concurrent stress tests that are not specific to the driver do. I think that the description of these tests should be quite vague, and allow individual drivers to create tests that are more appropriate for them and have higher chances of discovering bugs. Even implementations of simpler specification prose tests sometimes differ from the prose description, at least in the Java driver.

      Concurrent usage stress test.

      Create a pool with various minPoolSize/maxPoolSize, maxIdleTimeMS, other non-standard options that your pool may have. Utilizing extreme values of options may be helpful. Concurrently use the pool (checkOut/checkIn synchronously/asynchronously) using different numbers of concurrent users. Spontaneously invalidate (clear followed by ready) / spontaneously clear / spontaneously ready while using the pool; vary the probability of such disturbances in different executions.

      Expectations:

      • If any action fails, including failing with a timeout but excluding PoolClearedError, then the test fails.
      • If the test hangs, then it fails. This implies that the test has an adequate timeout.

      I cannot stress enough how helpful it is to have assertions (checks of expectations that may be violated if and only if the driver code is incorrect) in the Connection Pool code itself.

      Hand-over mechanism concurrent stress test.

      While this test may not be relevant to some drivers, I know that it is relevant to some others besides the Java driver. The specification states "the Pool MUST NOT service any newer checkOut requests before fulfilling the original one which could not be fulfilled". This fairness requirement means that a checked in connection must become available only to the checkOut request that has been waiting longer than others. I refer to the mechanism that achieves such behavior as the hand-over mechanism. In some drivers, including the Java driver, it adds enough complexity to be tested additionally.

      Create a pool with connections not expiring and no background thread populating it. The maxPoolSize must be equal to maxConnecting + openConnectionsCount + wiggleCount. The meaning of the last two terms will soon become clear.

      1. Checkout openConnectionsCount connections.
      2. Initiate checkOut of maxConnecting connections and ensure that they are stuck infinitely trying to be established. The Java driver does this by using a fake connection implementation.
      3. As a result of the stuck connections, no connection may be established, and the only way to check out a connection is to have it checked in before checking out.
      4. Start checking in openConnectionsCount connections that were previously checked out. Start Concurrently checking out openConnectionsCount connections.
      5. If the hand-over mechanism works, the test will be able to successfully checkout openConnectionsCount connections without pool creating any connections.
      6. wiggleCount is needed to open opportunities to create new connections and then check that no connections were created nonetheless.
      7. In order to take the best of wiggleCount, the test should strive to have at least wiggleCount concurrent checkOut calls when checking in / checking out openConnectionsCount connections.

      Expectations:

      • If any action fails, then the test fails.
      • If the test hangs, then it fails. This implies that the test has an adequate timeout.
      • If more that maxConnecting + openConnectionsCount connections are created, then the test fails.

            Assignee:
            Unassigned Unassigned
            Reporter:
            valentin.kovalenko@mongodb.com Valentin Kavalenka
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: