[SERVER-39082] Retry on TransientTransaction errors in targeted transaction jstests Created: 18/Jan/19  Updated: 31/Mar/20  Resolved: 31/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding, Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Haley Connelly
Resolution: Won't Do Votes: 0
Labels: sharding-4.4-stabilization, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-41338 Add LockTimeout as allowed error for ... Closed
Related
related to SERVER-46510 Concurrent drop-pending notifications... Closed
is related to SERVER-36310 Unblacklist transactions from all jsc... Closed
Operating System: ALL
Sprint: Sharding 2019-02-25, Sharding 2019-03-11, Sharding 2019-03-25, Sharding 2019-04-08, Sharding 2019-04-22, Sharding 2019-05-06, Sharding 2020-04-06
Participants:
Linked BF Score: 10

 Description   

If a jstest that manually runs transaction statements fails with a TransientTransaction error, it is safe to retry the whole transaction, but it currently has no way to retry the previous statements, so the test will fail unnecessarily. The suites that run core/txns tests against sharded clusters are particularly susceptible to this, because of the chance of snapshot errors on slow hosts.

Each targeted test / suite could use an override file (similar to the existing txn_override.js) that batches transaction operations and supports advancing a session's txnNumber transparently to do this. The override wouldn't have to handle network errors or failovers, just transient transaction errors.



 Comments   
Comment by Haley Connelly [ 31/Mar/20 ]

There are no failures on the 4.2, 4.4, or master branches that we've seen recently. judah.schvimer talked about enableMajorityReadConcern=false in SERVER-46510, but even with --majorityReadConcern=off to resmoke we didn't see failures on the latest master with the transaction_error_handling.js. There are still failures on the 4.0 branch. It is still the scenario which Judah described in SERVER-46510. Closing this as won't do. 

Comment by Haley Connelly [ 30/Mar/20 ]

On a 4.0 spawned host, the error can be reproduced by setting maxTransactionLockRequestTimeoutMillis=1 and running resmoke with --suites=logical_session_cache_replication_100ms_refresh_jscore_passthrough jstests/core/txns/no_implicit_collection_creation_in_txn.js  on repeat (I did it 100 times).

Comment by Max Hirschhorn [ 14/Feb/19 ]

My only problem with this approach is that it will still be possible for a targeted transaction test to fail because of an unexpected SnapshotTooOld on a slow host, but I can't actually find a failure because of that, so I'd be fine addressing that later if it turns out to be a problem.

If the test fails due to an SnapshotTooOld error response, then it might be (a) the server has a bug where it isn't preserving enough history, or (b) the test is race-prone and should use the WTPreserveSnapshotHistoryIndefinitely failpoint to preserve more history. Either way, it sounds reasonable to me to address that separately if it ever comes up.

Comment by Jack Mulrow [ 14/Feb/19 ]

That's a good point about raising the max lock timeout. From looking through the linked BFs and their duplicate BFGs, it looks like every one failed trying to take a lock during commit or prepare and it should be fine to raise the limit for the core/txns tests since I don't think their transactions rely on the timeout (except the ones explicitly testing it, where we can use the original value).

My only problem with this approach is that it will still be possible for a targeted transaction test to fail because of an unexpected SnapshotTooOld on a slow host, but I can't actually find a failure because of that, so I'd be fine addressing that later if it turns out to be a problem.

Comment by Max Hirschhorn [ 13/Feb/19 ]

In particular, since we don't expect these errors to happen often, I'm wondering if the test infrastructure can somehow retry an entire jstest when it fails with a transient transaction error instead of making the tests retry internally. Max Hirschhorn, what do you think?

jack.mulrow, I'd be worried that it may mean if a test does enough multi-statement transactions, then the likelihood of one of them failing with a TransientTransactionError is almost sure, and would therefore cause the entire test to be rerun continuously. Retrying just the multi-statement transaction would increase the likelihood of making forward progress in the test.

My understanding is that the tests in the jstests/core/txns/ directory weren't written with the expectation that they'd ever encounter transient transactions errors. I think trying to do implicit retries is going to be difficult for some of the reasons you mentioned around mutable state (e.g. the cursor id). Is it possible to change the server's configuration in the sharded_*_txns*.yml test suites by raising the maxTransactionLockRequestTimeoutMillis server parameter to avoid encountering transient transaction errors? For example, the testing infrastructure sets the transactionLifetimeLimitSeconds server parameter to 3 hours to avoid transaction being killed due to the Evergreen machine being slow.

Comment by Jack Mulrow [ 11/Feb/19 ]

I've spent a little while working on this ticket, and I'm realizing there are a few problems with the override in the description. The main ones are:

  1. Some tests in the linked BFs run concurrent transactions on different sessions, so to cover these cases the override needs to store operations from the latest transaction per session and the global order the test ran them in, which raises questions about how to handle a transient failure in one transaction, but not the others. I think we can handle this by aborting every transaction then retrying every statement in the original order, but I can see writing this taking a while.
  2. At least one failing test uses the result from one transaction statement to construct a later statement in the same transaction, so retrying the statements as they were received won't work if it's possible for the result from the retry to be different, like this statement in kill_cursors_in_transaction.js that kills a cursor established earlier in the transaction (the cursor established by the retry will have a new cursor id).

To resolve the linked BFs, we could write an override without handling these cases and manually add retries to tests that aren't covered, but that makes me think we should try to find a more generic fix instead. In particular, since we don't expect these errors to happen often, I'm wondering if the test infrastructure can somehow retry an entire jstest when it fails with a transient transaction error instead of making the tests retry internally. max.hirschhorn, what do you think?

CC judah.schvimer

Generated at Thu Feb 08 04:50:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.