[DRIVERS-2004] Introduce prose concurrent stress tests for the Connection Pool Created: 03/Dec/21  Updated: 03/Oct/23

Status: Backlog
Project: Drivers
Component/s: CMAP
Fix Version/s: None

Type: Spec Change Priority: Unknown
Reporter: Valentin Kavalenka Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to CXX-2603 Introduce prose concurrent stress tes... Blocked
split to CDRIVER-4509 Introduce prose concurrent stress tes... Blocked
split to CSHARP-4372 Introduce prose concurrent stress tes... Blocked
split to GODRIVER-2593 Introduce prose concurrent stress tes... Blocked
split to JAVA-4786 Introduce prose concurrent stress tes... Blocked
split to MOTOR-1052 Introduce prose concurrent stress tes... Blocked
split to NODE-4740 Introduce prose concurrent stress tes... Blocked
split to PYTHON-3483 Introduce prose concurrent stress tes... Blocked
split to RUBY-3162 Introduce prose concurrent stress tes... Blocked
split to RUST-1514 Introduce prose concurrent stress tes... Blocked
split to PHPLIB-1026 Introduce prose concurrent stress tes... Closed
Driver Changes: Needed
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4509 Blocked
CXX-2603 Blocked
CSHARP-4372 Blocked
GODRIVER-2593 Blocked
JAVA-4786 Blocked
NODE-4740 Blocked
MOTOR-1052 Blocked
PYTHON-3483 Blocked
PHPLIB-1026 Won't Do
RUBY-3162 Blocked
RUST-1514 Blocked
SWIFT-1664 Won't Do

 Description   

Summary

The existing specification tests does not adequately test the Connection Pool under an intensive concurrent usage: not only they use the Connection Pool lightly and not necessary concurrently, they also try to achieve the same predictable execution in order to observe a specific set of events in a specific order. The concurrent stress tests use a different approach: they try to subject the object under test to an intensive variable usage in order to produce many different execution paths, including otherwise rare ones. This is possible because the assertions in concurrent stress tests are simpler than those in usual tests: they usually expect the absence of unexpected behaviors, e.g., exceptions or dead locks, and potentially compliance with very basic specified guarantees, like not returning null values from non-null methods.

Motivation

Who is the affected end user?

Driver engineers.

How does this affect the end user?

Having such tests increases the probability of a driver engineer discovering a bug in the Connection Pool before releasing the changes to driver users.

How likely is it that this problem or use case will occur?

It depends on the complexity of the changes to the Connection Pool. While working on "Avoiding connection storms" (JAVA-3890), the concurrent tests allowed me to discover and fix multiple bug that were not caught by existing specification tests. boris.dogadov reported similar experience.

If the problem does occur, what are the consequences and how severe are they?

If we miss a concurrency bug in the Connection Pool, it may cause serious problems in the application that uses the driver, e.g., dead locks, memory leaks. It is impossible to tell more concretely.

Is this issue urgent?

No.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

Yes.

Details

I am providing an overview of what the Java MongoDB driver concurrent stress tests that are not specific to the driver do. I think that the description of these tests should be quite vague, and allow individual drivers to create tests that are more appropriate for them and have higher chances of discovering bugs. Even implementations of simpler specification prose tests sometimes differ from the prose description, at least in the Java driver.

Concurrent usage stress test.

Create a pool with various minPoolSize/maxPoolSize, maxIdleTimeMS, other non-standard options that your pool may have. Utilizing extreme values of options may be helpful. Concurrently use the pool (checkOut/checkIn synchronously/asynchronously) using different numbers of concurrent users. Spontaneously invalidate (clear followed by ready) / spontaneously clear / spontaneously ready while using the pool; vary the probability of such disturbances in different executions.

Expectations:

  • If any action fails, including failing with a timeout but excluding PoolClearedError, then the test fails.
  • If the test hangs, then it fails. This implies that the test has an adequate timeout.

I cannot stress enough how helpful it is to have assertions (checks of expectations that may be violated if and only if the driver code is incorrect) in the Connection Pool code itself.

Hand-over mechanism concurrent stress test.

While this test may not be relevant to some drivers, I know that it is relevant to some others besides the Java driver. The specification states "the Pool MUST NOT service any newer checkOut requests before fulfilling the original one which could not be fulfilled". This fairness requirement means that a checked in connection must become available only to the checkOut request that has been waiting longer than others. I refer to the mechanism that achieves such behavior as the hand-over mechanism. In some drivers, including the Java driver, it adds enough complexity to be tested additionally.

Create a pool with connections not expiring and no background thread populating it. The maxPoolSize must be equal to maxConnecting + openConnectionsCount + wiggleCount. The meaning of the last two terms will soon become clear.

  1. Checkout openConnectionsCount connections.
  2. Initiate checkOut of maxConnecting connections and ensure that they are stuck infinitely trying to be established. The Java driver does this by using a fake connection implementation.
  3. As a result of the stuck connections, no connection may be established, and the only way to check out a connection is to have it checked in before checking out.
  4. Start checking in openConnectionsCount connections that were previously checked out. Start Concurrently checking out openConnectionsCount connections.
  5. If the hand-over mechanism works, the test will be able to successfully checkout openConnectionsCount connections without pool creating any connections.
  6. wiggleCount is needed to open opportunities to create new connections and then check that no connections were created nonetheless.
  7. In order to take the best of wiggleCount, the test should strive to have at least wiggleCount concurrent checkOut calls when checking in / checking out openConnectionsCount connections.

Expectations:

  • If any action fails, then the test fails.
  • If the test hangs, then it fails. This implies that the test has an adequate timeout.
  • If more that maxConnecting + openConnectionsCount connections are created, then the test fails.


 Comments   
Comment by Boris Dogadov [ 11/Oct/22 ]

patrick.freed@mongodb.com and I think that this test provides is very valuable coverage for any CMAP related changes, and suggest to consider this for the upcoming quarter.
It has been paying dividends in .NET, specifically some fundamental bugs where caught when implementing DRIVERS-1707

Comment by Patrick Freed [ 14/Dec/21 ]

These tests seem like they would be very valuable to have all drivers implement, putting in the backlog for now and will revisit before next quarterly planning.

Generated at Thu Feb 08 08:24:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.