-
Type: Spec Change
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: CMAP
-
None
-
Needed
Summary
The existing specification tests does not adequately test the Connection Pool under an intensive concurrent usage: not only they use the Connection Pool lightly and not necessary concurrently, they also try to achieve the same predictable execution in order to observe a specific set of events in a specific order. The concurrent stress tests use a different approach: they try to subject the object under test to an intensive variable usage in order to produce many different execution paths, including otherwise rare ones. This is possible because the assertions in concurrent stress tests are simpler than those in usual tests: they usually expect the absence of unexpected behaviors, e.g., exceptions or dead locks, and potentially compliance with very basic specified guarantees, like not returning null values from non-null methods.
Motivation
Who is the affected end user?
Driver engineers.
How does this affect the end user?
Having such tests increases the probability of a driver engineer discovering a bug in the Connection Pool before releasing the changes to driver users.
How likely is it that this problem or use case will occur?
It depends on the complexity of the changes to the Connection Pool. While working on "Avoiding connection storms" (JAVA-3890), the concurrent tests allowed me to discover and fix multiple bug that were not caught by existing specification tests. boris.dogadov reported similar experience.
If the problem does occur, what are the consequences and how severe are they?
If we miss a concurrency bug in the Connection Pool, it may cause serious problems in the application that uses the driver, e.g., dead locks, memory leaks. It is impossible to tell more concretely.
Is this issue urgent?
No.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
Yes.
Details
I am providing an overview of what the Java MongoDB driver concurrent stress tests that are not specific to the driver do. I think that the description of these tests should be quite vague, and allow individual drivers to create tests that are more appropriate for them and have higher chances of discovering bugs. Even implementations of simpler specification prose tests sometimes differ from the prose description, at least in the Java driver.
Concurrent usage stress test.
Create a pool with various minPoolSize/maxPoolSize, maxIdleTimeMS, other non-standard options that your pool may have. Utilizing extreme values of options may be helpful. Concurrently use the pool (checkOut/checkIn synchronously/asynchronously) using different numbers of concurrent users. Spontaneously invalidate (clear followed by ready) / spontaneously clear / spontaneously ready while using the pool; vary the probability of such disturbances in different executions.
Expectations:
- If any action fails, including failing with a timeout but excluding PoolClearedError, then the test fails.
- If the test hangs, then it fails. This implies that the test has an adequate timeout.
I cannot stress enough how helpful it is to have assertions (checks of expectations that may be violated if and only if the driver code is incorrect) in the Connection Pool code itself.
Hand-over mechanism concurrent stress test.
While this test may not be relevant to some drivers, I know that it is relevant to some others besides the Java driver. The specification states "the Pool MUST NOT service any newer checkOut requests before fulfilling the original one which could not be fulfilled". This fairness requirement means that a checked in connection must become available only to the checkOut request that has been waiting longer than others. I refer to the mechanism that achieves such behavior as the hand-over mechanism. In some drivers, including the Java driver, it adds enough complexity to be tested additionally.
Create a pool with connections not expiring and no background thread populating it. The maxPoolSize must be equal to maxConnecting + openConnectionsCount + wiggleCount. The meaning of the last two terms will soon become clear.
- Checkout openConnectionsCount connections.
- Initiate checkOut of maxConnecting connections and ensure that they are stuck infinitely trying to be established. The Java driver does this by using a fake connection implementation.
- As a result of the stuck connections, no connection may be established, and the only way to check out a connection is to have it checked in before checking out.
- Start checking in openConnectionsCount connections that were previously checked out. Start Concurrently checking out openConnectionsCount connections.
- If the hand-over mechanism works, the test will be able to successfully checkout openConnectionsCount connections without pool creating any connections.
- wiggleCount is needed to open opportunities to create new connections and then check that no connections were created nonetheless.
- In order to take the best of wiggleCount, the test should strive to have at least wiggleCount concurrent checkOut calls when checking in / checking out openConnectionsCount connections.
Expectations:
- If any action fails, then the test fails.
- If the test hangs, then it fails. This implies that the test has an adequate timeout.
- If more that maxConnecting + openConnectionsCount connections are created, then the test fails.
- split to
-
CXX-2603 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
CDRIVER-4509 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
CSHARP-4372 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
GODRIVER-2593 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
JAVA-4786 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
MOTOR-1052 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
NODE-4740 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
PYTHON-3483 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
RUBY-3162 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
RUST-1514 Introduce prose concurrent stress tests for the Connection Pool
- Blocked
-
PHPLIB-1026 Introduce prose concurrent stress tests for the Connection Pool
- Closed