The WriteConflictHelpers used by the transaction_write_conflicts.js test do not assure that all transaction it has created are aborted/comitted. This means that its possible for the transaction-related locks to never be released, which in the case of the associated BF called the validate_collections.js hook to hang forever.
This issue only occurs in multi-shard sharded collections as in those cases its possible that a transaction can start in one shard but not another. Causing there to be a difference in which transactions are "present" and able to be aborted/committed.
The validate_collections.js test is a hook that runs after the completion of a running test. In this case transaction_write_conflicts.js.
The latter test utilizes the T1StartsFirstAndWins and T2StartsFirstAndWins to define tests that given two separate operations, assure that the correct one is prevented from taking effect and returns a WriteConflict. These helpers are defined in write_conflicts.js
- The test runs the T1StartsFirstAndWins test case under the "multidelete-multiupdate" section . The writeConflictTest helper will create the session used to create the failing transaction
- Using the created session/txn information, the provided operation gets executed. Which fails with a WriteConflict error on Shard 1 and causes the transaction to be aborted on shard 1 L-2368. This is the first abortTransaction command Shard 0 receives.
- The test helper commits the transaction, which causes the MongoS to send the commit transaction command to the pertinent shards, which fails with NoSuchTransaction as Shard 0 hasn't started the transaction yet. The command expected to fail with NoSuchTransaction because of the abort caused by the WriteConflict. But it also fails because Shard 0 hasn't started the transaction yet.
- The failure of the commitTransaction results in MongoS implicitly aborting the transaction to all pertinent shards. This is the second abortTransaction command Shard 0 receives.
- Shard 0 starts the transaction L-2402. But since it was started after all of the aborts/commits, it will last until the transactionLifetimeLimitSeconds limit. Which in the test environments is 24 hours.
At the end of T1StartsFirstAndWins and T2StartsSecondAndWins. Add the following logic:
- Start a new transaction with the session associated with the failing transaction
- Send a no-op command to all of the shards (or alternatively all shards that didn't have the WriteConflict error)
- Commit the transaction started in step 1.