[DRIVERS-2816] Work around changes introduced in SERVER-82353 Created: 29/Jan/24  Updated: 01/Feb/24

Status: Needs Triage
Project: Drivers
Component/s: Unified Test Runner
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Jeffrey Yemin Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by NODE-5855 Unskip Transactions Spec Unified Test... Needs Triage
is depended on by PYTHON-4182 [Build Failure] test_unpin_after_Tran... In Code Review
Related
related to GODRIVER-3113 [Build Failure] unpin_after_Transient... In Code Review
Driver Changes: Needed

 Description   

The driver specification repository has a unified (i.e. implemented by all drivers) specification test (specified in YAML) which asserts that a ClientSession (an abstraction in drivers that wraps a server session) unpins from a mongos server after an aborted transaction. The test is "unpin after TransientTransactionError error on commit" in transactions/tests/unified/mongos-unpin.yml. The test started failing on January 24 on multiple drivers with an error like this:

Command failed with error 272 (MigrationConflict): 'Transaction 68a39c33-df8c-473d-9ffc-f760c59d170d - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= -  - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database mongos-unpin-db has undergone a catalog change operation at time Timestamp(1706131985, 34) and no longer satisfies the requirements for the current transaction which requires Timestamp(1706131981, 1). Transaction will be aborted.' on server localhost:27018. The full response is {"ok": 0.0, "errmsg": "Transaction 68a39c33-df8c-473d-9ffc-f760c59d170d - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= -  - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database mongos-unpin-db has undergone a catalog change operation at time Timestamp(1706131985, 34) and no longer satisfies the requirements for the current transaction which requires Timestamp(1706131981, 1). Transaction will be aborted.", "code": 272, "codeName": "MigrationConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1706131985, "i": 94}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "operationTime": {"$timestamp": {"t": 1706131985, "i": 94}}, "errorLabels": ["TransientTransactionError"]

In HELP-54982, it was determined to be caused by changed introduce in SERVER-82353, an issue that fixes a data-loss bug and should not be reverted.

The decision in the HELP ticket is to work around it in the unified test runner. Several solutions were attempted, and none were found to be satisfactory.

The underlying issue that needs to be worked around is that the test set up creates a new collection via one mongos server, and the test body inserts documents into the new collection on a (potentially) different mongos. Moreover, it uses a dedicated MongoClient for the former, and a test-defined MongoClient for the latter, so there is no causal relationship established between the two via cluster time gossiping. So the workaround has to involve gossiping the cluster time between MongoClient instances. This can be done via ClientSession#advanceClusterTime. The question is how to do it without breaking the tests in other ways. Things that have been tried:

  • Advance the cluster time for all sessions defined as test entities. This does fix the one test that is failing, but wouldn't handle any future tests that don't use session entities
  • For each client entity, create a ClientSession, advance the cluster time, and execute a ping using it. This also works, but breaks other tests because the ping command now appears as a command event (and indirectly in CMAP events)

One other idea that hasn't been tried yet:

  • Create a dedicated MongoClient for each mongos server in the multi-mongos connection string. Pick one of them to use for creating the collection and adding the initial documents. Then grab the cluster time from that MongoClient, and for each of the others create a session, advance the cluster time, and execute a ping. That will ensure the cluster time is gossiped to all mongos servers. Then proceed with normal entity creation and operation execution.


 Comments   
Comment by Jeffrey Yemin [ 30/Jan/24 ]

it's unclear to me if mongos-unpin: unpin after TransientTransactionError error on commit is the only test affected by SERVER-82353.

AFAIK it is the only test currently affected, but it's not clear to me why some of the other tests in the same file are not also affected. It's possible they are passing due to cluster time gossiping side effects from previous tests.

If the test failure here pertains to transactions, what is the relevance to "future tests that don't use session entities"? Wouldn't any transaction test utilize at least one session entity?

That's a good observation and one I think I overlooked when looking for flaws in the original approach that I tried (which did work). So maybe this really is a simple fix.

If the ping was considered part of a setup routine for the client entity, I see no reason for it to impact expectEvents, which denotes events "expected to be observed while executing the test's operations." That said, the spec is vague on this and never had any reason to go into significant detail, so we'd need to clarify this so all drivers know to be consistent about only observing events for execution of tests[].operations.

This might be a short cut taken by the Java driver that currently always works, since each MongoClient should be unused by anything else until test operations start. It didn't seem trivial to work around it at first glance, but it's feasible.

If the goal here is to gossip a cluster time across the entire mongos cluster, then I expect this is something we'd only want to do once for the entire test suite (or, more tactically, before the first test that might be affected by SERVER-82353, but no more than once).

It has to be done to create a causal link between the collection creation and the insert into that collection, so it requires getting the cluster time from after the creation occurs. To make this one test pass, it would only have to be done once, but it has to be at the right time.

Comment by Jeremy Mikola [ 30/Jan/24 ]

I read through HELP-54982, but it's unclear to me if mongos-unpin: unpin after TransientTransactionError error on commit is the only test affected by SERVER-82353.

If so, I think we can explore splitting out that test to its own file before attempting to alter the test runner's core logic. For instance, we can defer collection creation to the test itself (out of initialData) to ensure a common MongoClient is used for the operation.

I do have some questions about text in the description above.

Advance the cluster time for all sessions defined as test entities. This does fix the one test that is failing, but wouldn't handle any future tests that don't use session entities

If the test failure here pertains to transactions, what is the relevance to "future tests that don't use session entities"? Wouldn't any transaction test utilize at least one session entity? Also, if we presently have only one test failing due to SERVER-82353, I don't understand the need for a broader focus at present.

For each client entity, create a ClientSession, advance the cluster time, and execute a ping using it. This also works, but breaks other tests because the ping command now appears as a command event (and indirectly in CMAP events)

If the ping was considered part of a setup routine for the client entity, I see no reason for it to impact expectEvents, which denotes events "expected to be observed while executing the test's operations." That said, the spec is vague on this and never had any reason to go into significant detail, so we'd need to clarify this so all drivers know to be consistent about only observing events for execution of tests[].operations.

Create a dedicated MongoClient for each mongos server in the multi-mongos connection string. Pick one of them to use for creating the collection and adding the initial documents. Then grab the cluster time from that MongoClient, and for each of the others create a session, advance the cluster time, and execute a ping. That will ensure the cluster time is gossiped to all mongos servers. Then proceed with normal entity creation and operation execution.

This seems like the most non-invasive approach, as it's just isolated overhead for the test runner – similar to what we do by running distinct commands before any test that might execute distinct within a sharded transaction. And correct me if I'm wrong, but if the goal here is to gossip a cluster time across the entire mongos cluster, then I expect this is something we'd only want to do once for the entire test suite (or, more tactically, before the first test that might be affected by SERVER-82353, but no more than once).

Generated at Thu Feb 08 08:26:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.