Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-60624

txn_commit_optimizations_for_read_only_shards.js pauses replication on coordinator and can leave transaction stuck in prepare

    • Fully Compatible
    • ALL
    • v5.2, v5.0
    • Sharding 2021-12-13, Sharding 2021-12-27
    • 23
    • 2

      The test txn_commit_optimizations_for_read_only_shards.js runs transaction with the coordinateCommitReturnImmediatelyAfterPersistingDecision server parameter enabled.

      This means that the commitTransaction command will return early as soon as the _decisionPromise gets emplaced (either successfully or due to an error).

      This means that the next test case will be able to start before the TransactionCoordinator is finished with the existing transaction. Which is part of the coverage for this test.

      The problem
      For certain test cases, the beforeStatements function stops server replication. Meaning that the secondary stops applying oplogs.

      This results in the following being possible:

      1. txnNumber 51 starts. If the secondary falls behind the oplog. Then the opTime for decisionPersisted hasn't been reached yet. But execution continues and the _decisionPromise is emplaced.
      2. The next test case starts for txnNumber 52 and replication is completely stopped. This results in the existing transaction to be stuck waiting for majority write concern
      3. The new test case gets stuck waiting for txnNumber 51 to exit the prepared state
      4. Since the new test case can never finish (because txnNumber 52 is waiting for the previous one to exit the prepared state), replication is never restarted and txnNumber 51 can never finish.
      5. This will cause the test to hang forever

      Since the issue arises form the test stopping replication with the coordinateCommitReturnImmediatelyAfterPersistingDecision flag enabled, this is a test-only problem

      Proposed Solution
      If either in the cleanUp option available in the failureMode or in the for loop itself, we would wait for the existing transaction to finish before moving on to the next test then this problem wouldn't occur. As then, the new test case wouldn't be able to stop replication before the transaction was finished.

            matt.boros@mongodb.com Matt Boros
            luis.osta@mongodb.com Luis Osta (Inactive)
            0 Vote for this issue
            3 Start watching this issue