[DRIVERS-2816] Work around changes introduced in SERVER-82353 Created: 29/Jan/24 Updated: 01/Feb/24 |
|
| Status: | Needs Triage |
| Project: | Drivers |
| Component/s: | Unified Test Runner |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Unknown |
| Reporter: | Jeffrey Yemin | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Driver Changes: | Needed | ||||||||||||||||||||
| Description |
|
The driver specification repository has a unified (i.e. implemented by all drivers) specification test (specified in YAML) which asserts that a ClientSession (an abstraction in drivers that wraps a server session) unpins from a mongos server after an aborted transaction. The test is "unpin after TransientTransactionError error on commit" in transactions/tests/unified/mongos-unpin.yml. The test started failing on January 24 on multiple drivers with an error like this:
In HELP-54982, it was determined to be caused by changed introduce in The decision in the HELP ticket is to work around it in the unified test runner. Several solutions were attempted, and none were found to be satisfactory. The underlying issue that needs to be worked around is that the test set up creates a new collection via one mongos server, and the test body inserts documents into the new collection on a (potentially) different mongos. Moreover, it uses a dedicated MongoClient for the former, and a test-defined MongoClient for the latter, so there is no causal relationship established between the two via cluster time gossiping. So the workaround has to involve gossiping the cluster time between MongoClient instances. This can be done via ClientSession#advanceClusterTime. The question is how to do it without breaking the tests in other ways. Things that have been tried:
One other idea that hasn't been tried yet:
|
| Comments |
| Comment by Jeffrey Yemin [ 30/Jan/24 ] |
AFAIK it is the only test currently affected, but it's not clear to me why some of the other tests in the same file are not also affected. It's possible they are passing due to cluster time gossiping side effects from previous tests.
That's a good observation and one I think I overlooked when looking for flaws in the original approach that I tried (which did work). So maybe this really is a simple fix.
This might be a short cut taken by the Java driver that currently always works, since each MongoClient should be unused by anything else until test operations start. It didn't seem trivial to work around it at first glance, but it's feasible.
It has to be done to create a causal link between the collection creation and the insert into that collection, so it requires getting the cluster time from after the creation occurs. To make this one test pass, it would only have to be done once, but it has to be at the right time. |
| Comment by Jeremy Mikola [ 30/Jan/24 ] |
|
I read through HELP-54982, but it's unclear to me if mongos-unpin: unpin after TransientTransactionError error on commit is the only test affected by If so, I think we can explore splitting out that test to its own file before attempting to alter the test runner's core logic. For instance, we can defer collection creation to the test itself (out of initialData) to ensure a common MongoClient is used for the operation. I do have some questions about text in the description above.
If the test failure here pertains to transactions, what is the relevance to "future tests that don't use session entities"? Wouldn't any transaction test utilize at least one session entity? Also, if we presently have only one test failing due to
If the ping was considered part of a setup routine for the client entity, I see no reason for it to impact expectEvents, which denotes events "expected to be observed while executing the test's operations." That said, the spec is vague on this and never had any reason to go into significant detail, so we'd need to clarify this so all drivers know to be consistent about only observing events for execution of tests[].operations.
This seems like the most non-invasive approach, as it's just isolated overhead for the test runner – similar to what we do by running distinct commands before any test that might execute distinct within a sharded transaction. And correct me if I'm wrong, but if the goal here is to gossip a cluster time across the entire mongos cluster, then I expect this is something we'd only want to do once for the entire test suite (or, more tactically, before the first test that might be affected by |