Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2816

Gossip cluster time from internal MongoClient to session entities

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Component/s: Unified Test Runner
    • None
    • Needed
    • Hide

      Summary of necessary driver changes

      • Modify the unified test runner to collect the cluster time from the internal MongoClient following initialData operations, and use it to advance cluster times of all session entities created during the subsequent test. See mongodb/specifications@fd6ae12 for implementation details.

      Context for other referenced/linked tickets

      • SERVER-82353 addressed a correctness issue for sharded clusters, which requires gossipping the cluster time between mongos hsots (particularly the server used to create collections and other(s) used to execute transactions) in order to avoid a MigrationConflict error.
      Show
      Summary of necessary driver changes Modify the unified test runner to collect the cluster time from the internal MongoClient following initialData operations, and use it to advance cluster times of all session entities created during the subsequent test. See mongodb/specifications@fd6ae12 for implementation details. Context for other referenced/linked tickets SERVER-82353 addressed a correctness issue for sharded clusters, which requires gossipping the cluster time between mongos hsots (particularly the server used to create collections and other(s) used to execute transactions) in order to avoid a MigrationConflict error.
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-5304 Backlog
      CXX-2841 Backlog
      CSHARP-4979 Fixed 2.25.0
      GODRIVER-3137 Backlog
      JAVA-5334 Fixed 4.11.2, 5.0.1, 5.1.0
      NODE-5962 Fixed 6.5.0
      MOTOR-1261 Duplicate
      PYTHON-4227 Fixed 4.7
      PHPLIB-1400 Fixed 1.17.1
      RUBY-3405 Fixed 2.20.0
      RUST-1855 Fixed 3.0.0
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-5304 Backlog CXX-2841 Backlog CSHARP-4979 Fixed 2.25.0 GODRIVER-3137 Backlog JAVA-5334 Fixed 4.11.2, 5.0.1, 5.1.0 NODE-5962 Fixed 6.5.0 MOTOR-1261 Duplicate PYTHON-4227 Fixed 4.7 PHPLIB-1400 Fixed 1.17.1 RUBY-3405 Fixed 2.20.0 RUST-1855 Fixed 3.0.0

      The driver specification repository has a unified (i.e. implemented by all drivers) specification test (specified in YAML) which asserts that a ClientSession (an abstraction in drivers that wraps a server session) unpins from a mongos server after an aborted transaction. The test is "unpin after TransientTransactionError error on commit" in transactions/tests/unified/mongos-unpin.yml. The test started failing on January 24 on multiple drivers with an error like this:

      Command failed with error 272 (MigrationConflict): 'Transaction 68a39c33-df8c-473d-9ffc-f760c59d170d - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= -  - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database mongos-unpin-db has undergone a catalog change operation at time Timestamp(1706131985, 34) and no longer satisfies the requirements for the current transaction which requires Timestamp(1706131981, 1). Transaction will be aborted.' on server localhost:27018. The full response is {"ok": 0.0, "errmsg": "Transaction 68a39c33-df8c-473d-9ffc-f760c59d170d - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= -  - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27219 during a transaction :: caused by :: Database mongos-unpin-db has undergone a catalog change operation at time Timestamp(1706131985, 34) and no longer satisfies the requirements for the current transaction which requires Timestamp(1706131981, 1). Transaction will be aborted.", "code": 272, "codeName": "MigrationConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1706131985, "i": 94}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "operationTime": {"$timestamp": {"t": 1706131985, "i": 94}}, "errorLabels": ["TransientTransactionError"]
      

      In HELP-54982, it was determined to be caused by changed introduce in SERVER-82353, an issue that fixes a data-loss bug and should not be reverted.

      The decision in the HELP ticket is to work around it in the unified test runner. Several solutions were attempted, and none were found to be satisfactory.

      The underlying issue that needs to be worked around is that the test set up creates a new collection via one mongos server, and the test body inserts documents into the new collection on a (potentially) different mongos. Moreover, it uses a dedicated MongoClient for the former, and a test-defined MongoClient for the latter, so there is no causal relationship established between the two via cluster time gossiping. So the workaround has to involve gossiping the cluster time between MongoClient instances. This can be done via ClientSession#advanceClusterTime. The question is how to do it without breaking the tests in other ways. Things that have been tried:

      • Advance the cluster time for all sessions defined as test entities. This does fix the one test that is failing, but wouldn't handle any future tests that don't use session entities
      • For each client entity, create a ClientSession, advance the cluster time, and execute a ping using it. This also works, but breaks other tests because the ping command now appears as a command event (and indirectly in CMAP events)

      One other idea that hasn't been tried yet:

      • Create a dedicated MongoClient for each mongos server in the multi-mongos connection string. Pick one of them to use for creating the collection and adding the initial documents. Then grab the cluster time from that MongoClient, and for each of the others create a session, advance the cluster time, and execute a ping. That will ensure the cluster time is gossiped to all mongos servers. Then proceed with normal entity creation and operation execution.

            Assignee:
            jeff.yemin@mongodb.com Jeffrey Yemin
            Reporter:
            jeff.yemin@mongodb.com Jeffrey Yemin
            Jeffrey Yemin Jeffrey Yemin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: