[DRIVERS-2218] Clarify intended bounding of implicit session allocation and rework flaky prose test Created: 25/Feb/22  Updated: 25/Jul/22

Status: Implementing
Project: Drivers
Component/s: Sessions
Fix Version/s: None

Type: Spec Change Priority: Unknown
Reporter: Shane Harvey Assignee: Neal Beeken
Resolution: Unresolved Votes: 0
Labels: prose-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to JAVA-4531 Clarify intended bounding of implicit... Closed
split to CDRIVER-4311 Clarify intended bounding of implicit... Closed
split to CSHARP-4097 Clarify intended bounding of implicit... Closed
split to CXX-2465 Clarify intended bounding of implicit... Closed
split to GODRIVER-2332 Clarify intended bounding of implicit... Closed
split to MOTOR-909 Clarify intended bounding of implicit... Closed
split to NODE-4082 Clarify intended bounding of implicit... Closed
split to PHPLIB-811 Clarify intended bounding of implicit... Closed
split to PYTHON-3169 Clarify intended bounding of implicit... Closed
split to RUBY-2924 Clarify intended bounding of implicit... Closed
split to RUST-1218 Clarify intended bounding of implicit... Closed
Problem/Incident
causes DRIVERS-2398 Limit mongos Hosts for Implicit Sessi... Implementing
Related
related to DRIVERS-1030 Drivers should check out an implicit ... Closed
Driver Changes: Needed
Downstream Changes Summary:

Review the updated clarification for spec test 13 and ensure your driver is aligned with the test's assertion requirements. 

Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4311 Works as Designed
CXX-2465 Works as Designed
CSHARP-4097 Done 2.16.0
GODRIVER-2332 Duplicate
JAVA-4531 Fixed 4.6.0
NODE-4082 Done 4.6.0
MOTOR-909 Duplicate
PYTHON-3169 Duplicate
PHPLIB-811 Won't Do
RUBY-2924 Fixed 2.18.0
RUST-1218 Duplicate
SWIFT-1516 Won't Fix

 Description   

The prose test added in DRIVERS-1030 is flakey:

13. To confirm that implicit sessions only allocate their server session after a successful connection checkout
 
    * Create a MongoClient with the following options: ``maxPoolSize=1`` and ``retryWrites=true``
    * Attach a command started listener that collects each command's lsid
    * Initiate the following concurrent operations
      * insertOne
      * deleteOne
      * updateOne
      * bulkWrite ``[ { updateOne } ]``
      * findOneAndDelete
      * findOneAndUpdate
      * findOneAndReplace
      * find
    * Wait for all operations to complete
    * Assert that all commands contain the same lsid

The issue is that the session is checked back into the pool after the connection is checked back in. So there's always the risk that this test uses >1 server session. The only way to guarantee this test is to also optimize implicit session deallocation as well. However, that is potentially trickier since we would need to ensure that retryable reads/writes, cursors, and other pinned operations don't deallocate the session too early.

Acceptance Criteria

  • Change server session pooling to be required by the specification - Some quick analysis implies this would be an easy and no-op change
    • If there's contention on this change file a follow up DRIVERS ticket to address this, and make the limiting of session allocation optional for DRIVERS without a session pool
  • Explicitly define that sessions should be acquired at connection checkout and SHOULD NOT (but may in extremely rare circumstances) be released before checkIn
  • Explicitly state that sessions MUST be released and reused to meet the goal of reducing session allocation to be within the bounds of [1, numConcurrentOperationCount)
  • Update test 13 to account for the variability in implementation details (or split out the following into a further test case)
    • Drivers MAY assert that exactly one session is used for all the concurrent operations listed in the test, however this is racey if the session isn't released before checkIn (which SHOULD NOT be attempted)
    • Drivers SHOULD assert that after repeated runs they are able to achieve the use of exactly one session, this will statistically prove we've reduced the allocation amount
    • Drivers MUST assert that the number of allocated sessions never exceeds the number of concurrent operations executing


 Comments   
Comment by Githook User [ 24/Mar/22 ]

Author:

{'name': 'Neal Beeken', 'email': 'neal.beeken@mongodb.com', 'username': 'nbbeeken'}

Message: DRIVERS-2218: Clarify intended bounding of implicit session allocation and rework flaky prose test (#1157)
Branch: master
https://github.com/mongodb/specifications/commit/4a9defdbb1952f794042673360b7b3d0b568e7c3

Comment by Neal Beeken [ 02/Mar/22 ]

I still think there's no issue and test value with the prose test as it is the test does confirm what is asked: "This limits the number of implicit sessions to never exceed the maximum connection pool size."
(I agree, this sentence could use more context, it's true this sentence was never fully correct because as part of the original ticket/ask there was to be no change to cursor-session lifetime and explicit sessions)

If we want to allow drivers to adopt the current requirement of ensuring the order of operations is always: checkout, acquire (operation), release, checkin, then that is what asserting the same LSID for every operation in a pool size of 1 should accomplish. Although there is a gap, where it says you should run a find, but doesn't clarify that it should be a small query. I haven't seen much precedent for that in other tests, but nonetheless should be called out. 

To Daria's summary (thanks for that!) I think #5 is what we really want to get at, we need to at least assert that drivers have changed their session acquire logic to permit operations to run instead of running into a TooManyLogicalSessions error before proceeding. Unless you are suggesting we should revisit the spec to explicitly disallow attempting to release the session before checkin completes? We could consider that to make this change more uniform. (In reference to Shane's statement "I would advise drivers not try to optimize implicit session deallocation")

Comment by Shane Harvey [ 02/Mar/22 ]

Sure we can do that in this ticket as long as we also fix the issue in the prose test.

Comment by Daria Pardue [ 02/Mar/22 ]

shane.harvey All the more reason to update the language. Would you be opposed to repurposing this ticket for that update to the language? We can then send it to leads triage since we probably want the clarification sooner rather than later, given the teams are in the process of implementing this.

Comment by Shane Harvey [ 01/Mar/22 ]

Yeah that's true. It's also worth mentioning that sentence is incorrect regardless since there are other factors that will create more than maxPoolSize sessions, multi-batch cursors for example. Even ignoring cursors, clients that run operations on multiple nodes (eg sharded clusters or non-primary read preferences) can create maxPoolSize * # hosts connections.

Comment by Shane Harvey [ 01/Mar/22 ]

Great summary daria.pardue. However, I would advise drivers not try to optimize implicit session deallocation without a benchmark showing that such behavior is required. I believe it would be too risky given the complexity around retryable reads/writes, cursors, and other pinned operations. My understanding is that simply delaying the allocation provides nearly all of the benefit we need to solve the original issue in DRIVERS-1030 and the test we added is just a bit too strict. I would like to see confirmation via a performance test or repro for DRIVERS-1030 with an updated driver to confirm this belief before we require (or even suggest) optimizing implicit session deallocation.

Comment by Shane Harvey [ 28/Feb/22 ]

One solution for this might be to acknowledge the race in the test description and update the test to be more like:

  • Run 100 concurrent operations (a mix of the different driver helpers)
  • Wait for the operations to complete
  • Assert no more than X different server sessions were used. 5 could be a suitable value for X but it can vary by driver. It should be chosen to accommodate the race described above.

Another solution might be to repeat the existing test X times until the assertion that "all commands contain the same lsid" holds true.

Comment by Shane Harvey [ 28/Feb/22 ]

The test needs to be concurrent because DRIVERS-1030 helps solve the problem of too many sessions being created for concurrent operations. As for the race, consider:

  1. Thread A runs find(), checks out a connection, checks out the first server session, runs the command, and checks the connection back into the pool, and then Thread A gets suspended.
  2. Thread B runs insertOne(), checks out the existing pooled connection, checks out a second server session, completes the insert, and checks the connection and session back into the pool.
  3. Thread A resumes, sees the cursor.id is 0 and checks the session back into the pool.
  4. 2 server sessions we're used, not 1 as the test expects.
Comment by Robert Stam [ 28/Feb/22 ]

Why does this test have to be concurrent? Wouldn't testing the operations sequentially (which is what I first assumed it would be doing until I read more carefully) be sufficient to test that all operations use the same `lsid`?

Comment by Robert Stam [ 28/Feb/22 ]

I just realized that the test says the operations should be concurrent.

Are you referring to a concurrency race condition?

Comment by Robert Stam [ 28/Feb/22 ]

> The issue is that the session is checked back into the pool after the connection is checked back in

I don't understand the issue. Yes the session might be checked back in after the connection, but as long as it is checked in before the next implicit session is created we should be fine.

Generated at Thu Feb 08 08:25:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.