Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-59916

T{1, 2}Starts{First, Second}AndWins In WriteConflictHelpers Does Not Synchronize Committing Of Failed Transaction

    • Fully Compatible
    • ALL
    • v5.0, v4.4, v4.2
    • 48

      Summary

      The WriteConflictHelpers used by the transaction_write_conflicts.js test do not assure that all transaction it has created are aborted/comitted. This means that its possible for the transaction-related locks to never be released, which in the case of the associated BF called the validate_collections.js hook to hang forever.

      This issue only occurs in multi-shard sharded collections as in those cases its possible that a transaction can start in one shard but not another. Causing there to be a difference in which transactions are "present" and able to be aborted/committed.

      Context

      The validate_collections.js test is a hook that runs after the completion of a running test. In this case transaction_write_conflicts.js.

      The latter test utilizes the T1StartsFirstAndWins and T2StartsFirstAndWins to define tests that given two separate operations, assure that the correct one is prevented from taking effect and returns a WriteConflict. These helpers are defined in write_conflicts.js

      The Sequence

      1. The test runs the T1StartsFirstAndWins test case under the "multidelete-multiupdate" section . The writeConflictTest helper will create the session used to create the failing transaction
      2. Using the created session/txn information, the provided operation gets executed. Which fails with a WriteConflict error on Shard 1 and causes the transaction to be aborted on shard 1 L-2368. This is the first abortTransaction command Shard 0 receives.
      3. The test helper commits the transaction, which causes the MongoS to send the commit transaction command to the pertinent shards, which fails with NoSuchTransaction as Shard 0 hasn't started the transaction yet. The command expected to fail with NoSuchTransaction because of the abort caused by the WriteConflict. But it also fails because Shard 0 hasn't started the transaction yet.
      4. The failure of the commitTransaction results in MongoS implicitly aborting the transaction to all pertinent shards. This is the second abortTransaction command Shard 0 receives.
      5. Shard 0 starts the transaction L-2402. But since it was started after all of the aborts/commits, it will last until the transactionLifetimeLimitSeconds limit. Which in the test environments is 24 hours.

      Proposed Solution
      At the end of T1StartsFirstAndWins and T2StartsSecondAndWins. Add the following logic:

      1. Start a new transaction with the session associated with the failing transaction
      2. Send a no-op command to all of the shards (or alternatively all shards that didn't have the WriteConflict error)
      3. Commit the transaction started in step 1.

      Relevant Logs

      [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.489+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: WriteConflict: Encountered error from localhost:20503 during a transaction :: caused by :: WriteConflict
      [ShardedClusterFixture:job2:shard1:primary] 2021-08-25T22:33:46.483+0000 D4 TXN      [conn51] New transaction started with txnNumber: 0 on session with lsid 93007dc5-7a2c-456a-8241-92fef7de25c5
      [ShardedClusterFixture:job2:shard1:primary] 2021-08-25T22:33:46.487+0000 I  TXN      [conn51] transaction parameters:{ lsid: { id: UUID("93007dc5-7a2c-456a-8241-92fef7de25c5"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 0, autocommit: false, readConcern: { level: "snapshot", afterClusterTime: Timestamp(1629930826, 89) } }, readTimestamp:Timestamp(0, 0), terminationCause:aborted timeActiveMicros:4799 timeInactiveMicros:21 numYields:0 locks:{ ReplicationStateTransition: { acquireCount: { w: 6 } }, Global: { acquireCount: { r: 3, w: 2 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { w: 2 } }, Mutex: { acquireCount: { r: 6 } }, oplog: { acquireCount: { r: 2 } } } storage:{} wasPrepared:0, 4ms
      [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.481+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 New transaction started
       [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.489+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: WriteConflict: Encountered error from localhost:20503 during a transaction :: caused by :: WriteConflict
      [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.496+0000 I  TXN      [conn97] transaction parameters:{ lsid: { id: UUID("93007dc5-7a2c-456a-8241-92fef7de25c5"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 0, autocommit: false, readConcern: { afterClusterTime: Timestamp(1629930826, 89) } }, numParticipants:2, terminationCause:aborted, abortCause:WriteConflict, timeActiveMicros:15122, timeInactiveMicros:0, 15ms
      [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.512+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: NoSuchTransaction: 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Failed to commit transaction because a previous statement on the transaction participant shard-rs0 was unsuccessful.
      [ShardedClusterFixture:job2:shard0:primary] 2021-08-25T22:33:46.517+0000 D4 TXN      [conn94] New transaction started with txnNumber: 0 on session with lsid 93007dc5-7a2c-456a-8241-92fef7de25c5
      [ShardedClusterFixture:job2:shard0:primary] 2021-08-25T22:33:46.517+0000 D3 TXN      [conn94] Inserting coordinator 93007dc5-7a2c-456a-8241-92fef7de25c5:0 into in-memory catalog
      
      

            Assignee:
            luis.osta@mongodb.com Luis Osta (Inactive)
            Reporter:
            luis.osta@mongodb.com Luis Osta (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: