-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: Unified Test Runner
-
None
-
Needed
-
The driver specification repository has a unified (i.e. implemented by all drivers) specification test (specified in YAML) which asserts that a ClientSession (an abstraction in drivers that wraps a server session) unpins from a mongos server after an aborted transaction. The test is "unpin after TransientTransactionError error on commit" in transactions/tests/unified/mongos-unpin.yml. The test started failing on January 24 on multiple drivers with an error like this:
Command failed with error 272 (MigrationConflict): 'Transaction 68a39c33-df8c-473d-9ffc-f760c59d170d - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= - - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database mongos-unpin-db has undergone a catalog change operation at time Timestamp(1706131985, 34) and no longer satisfies the requirements for the current transaction which requires Timestamp(1706131981, 1). Transaction will be aborted.' on server localhost:27018. The full response is {"ok": 0.0, "errmsg": "Transaction 68a39c33-df8c-473d-9ffc-f760c59d170d - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= - - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database mongos-unpin-db has undergone a catalog change operation at time Timestamp(1706131985, 34) and no longer satisfies the requirements for the current transaction which requires Timestamp(1706131981, 1). Transaction will be aborted.", "code": 272, "codeName": "MigrationConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1706131985, "i": 94}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "operationTime": {"$timestamp": {"t": 1706131985, "i": 94}}, "errorLabels": ["TransientTransactionError"]
In HELP-54982, it was determined to be caused by changed introduce in SERVER-82353, an issue that fixes a data-loss bug and should not be reverted.
The decision in the HELP ticket is to work around it in the unified test runner. Several solutions were attempted, and none were found to be satisfactory.
The underlying issue that needs to be worked around is that the test set up creates a new collection via one mongos server, and the test body inserts documents into the new collection on a (potentially) different mongos. Moreover, it uses a dedicated MongoClient for the former, and a test-defined MongoClient for the latter, so there is no causal relationship established between the two via cluster time gossiping. So the workaround has to involve gossiping the cluster time between MongoClient instances. This can be done via ClientSession#advanceClusterTime. The question is how to do it without breaking the tests in other ways. Things that have been tried:
- Advance the cluster time for all sessions defined as test entities. This does fix the one test that is failing, but wouldn't handle any future tests that don't use session entities
- For each client entity, create a ClientSession, advance the cluster time, and execute a ping using it. This also works, but breaks other tests because the ping command now appears as a command event (and indirectly in CMAP events)
One other idea that hasn't been tried yet:
- Create a dedicated MongoClient for each mongos server in the multi-mongos connection string. Pick one of them to use for creating the collection and adding the initial documents. Then grab the cluster time from that MongoClient, and for each of the others create a session, advance the cluster time, and execute a ping. That will ensure the cluster time is gossiped to all mongos servers. Then proceed with normal entity creation and operation execution.
- is related to
-
SERVER-82353 Multi-document transactions can miss documents when movePrimary runs concurrently
- Closed
-
RUBY-3400 Fix red specs on latest server
- Closed
- related to
-
DRIVERS-2860 Introduce new API for starting a causally consistent session from the timestamps of another operation or client session
- Backlog
-
GODRIVER-3113 [Build Failure] unpin_after_TransientTransactionError_error_on_commit
- Blocked
- split to
-
PHPLIB-1400 Gossip cluster time from internal MongoClient to session entities
- Closed
-
CDRIVER-5304 Gossip cluster time from internal MongoClient to session entities
- Backlog
-
CXX-2841 Gossip cluster time from internal MongoClient to session entities
- Backlog
-
GODRIVER-3137 Gossip cluster time from internal MongoClient to session entities
- Backlog
-
JAVA-5334 Gossip cluster time from internal MongoClient to session entities
- Closed
-
RUBY-3405 Gossip cluster time from internal MongoClient to session entities
- Closed
-
CSHARP-4979 Gossip cluster time from internal MongoClient to session entities
- Closed
-
MOTOR-1261 Gossip cluster time from internal MongoClient to session entities
- Closed
-
NODE-5962 Gossip cluster time from internal MongoClient to session entities
- Closed
-
PYTHON-4227 Gossip cluster time from internal MongoClient to session entities
- Closed
-
RUST-1855 Gossip cluster time from internal MongoClient to session entities
- Closed