[SERVER-40186] The logic in `auto_retry_transaction.js:withTxnAndAutoRetry` does not retry failed commits Created: 18/Mar/19  Updated: 29/Oct/23  Resolved: 25/Apr/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.1.11

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Gantt Dependency
has to be done before SERVER-40183 Create kill_sessions version of multi... Closed
Related
related to SERVER-38297 Killing session on a secondary curren... Closed
related to SERVER-39890 Make network_error_and_txn_override.j... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2019-03-25, Sharding 2019-04-08, Sharding 2019-04-22, Sharding 2019-05-06
Participants:
Linked BF Score: 5

 Description   

The multi_statement_transaction_kill_sessions_atomicity_isolation.js concurrency workload executes ordered updates in transactions using snapshot isolation and from time to time kills random sessions, finally validating that the transactions still committed in the correct order.

Enabling this workload against a sharded cluster leads to failures which appear as if transactions committed out of order:

         Error: [[ ]] != [[
          {
                  "tid" : 9,
                  "iteration" : 14,
                  "numUpdated" : 2
          },
          {
                  "tid" : 8,
                  "iteration" : 6,
                  "numUpdated" : 3
          },
          {
                  "tid" : 3,
                  "iteration" : 4,
                  "numUpdated" : 5
          },
          {
                  "tid" : 9,
                  "iteration" : 14,
                  "numUpdated" : 2
          }

The reason for these failures is not due to a server bug, but because interrupting a session running 2 phase commit on mongos, may still result in the transaction committing. As a result of this, because the test retries the entire transaction (with exactly the same parameters), the transaction ends up committing twice.

Proposed fix

The way to fix is would be to make withTxnAndAutoRetry retry just the commit, if it fails, similar to what the drivers spec requires, namely:

commitTransaction is a retryable write command. Drivers MUST retry once after commitTransaction fails with a retryable error according to the Retryable Writes Specification, regardless of whether retryWrites is set on the MongoClient or not.



 Comments   
Comment by Githook User [ 25/Apr/19 ]

Author:

{'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow', 'username': 'jsmulrow'}

Message: SERVER-40186 Retry interrupted commits in auto_retry_transaction.js
Branch: master
https://github.com/mongodb/mongo/commit/4b7ecadd7d1661f7d347d9bf990709b26cb539c4

Comment by Janna Golden [ 18/Mar/19 ]

Yeah, I had run into this previously. I filed SERVER-37323, which was closed and SERVER-37746 which we could turn into a more general ticket about making the shell compliant with the drivers spec.

Regardless, the fix sounds correct to me.

Comment by Max Hirschhorn [ 18/Mar/19 ]

I believe janna.golden had encountered an issue similar to this with the withTxnAndAutoRetry() function at one point, which is what gave me the thought to mention it being a possible cause to you last Friday.

Retrying the commitTransaction sounds correct to me based on what the Driver's specification says. The commitTransaction() function in the user-facing version of the mongo shell doesn't have any retry logic. CC judah.schvimer who has been working on how the version of the mongo shell used for testing retries commands in the face of network and other retryable error codes. I don't believe there's an existing SERVER ticket that tracks how the mongo shell isn't compliant with the Driver's specification for transactions.

Comment by Kaloian Manassiev [ 18/Mar/19 ]

max.hirschhorn, do you mind confirming whether the diagnosis and the proposed fix above sound right?

Generated at Thu Feb 08 04:54:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.