[SERVER-39082] Retry on TransientTransaction errors in targeted transaction jstests Created: 18/Jan/19 Updated: 31/Mar/20 Resolved: 31/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Haley Connelly |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | sharding-4.4-stabilization, sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Sharding 2019-02-25, Sharding 2019-03-11, Sharding 2019-03-25, Sharding 2019-04-08, Sharding 2019-04-22, Sharding 2019-05-06, Sharding 2020-04-06 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 10 | ||||||||||||||||||||||||
| Description |
|
If a jstest that manually runs transaction statements fails with a TransientTransaction error, it is safe to retry the whole transaction, but it currently has no way to retry the previous statements, so the test will fail unnecessarily. The suites that run core/txns tests against sharded clusters are particularly susceptible to this, because of the chance of snapshot errors on slow hosts. Each targeted test / suite could use an override file (similar to the existing txn_override.js) that batches transaction operations and supports advancing a session's txnNumber transparently to do this. The override wouldn't have to handle network errors or failovers, just transient transaction errors. |
| Comments |
| Comment by Haley Connelly [ 31/Mar/20 ] |
|
There are no failures on the 4.2, 4.4, or master branches that we've seen recently. judah.schvimer talked about enableMajorityReadConcern=false in |
| Comment by Haley Connelly [ 30/Mar/20 ] |
|
On a 4.0 spawned host, the error can be reproduced by setting maxTransactionLockRequestTimeoutMillis=1 and running resmoke with --suites=logical_session_cache_replication_100ms_refresh_jscore_passthrough jstests/core/txns/no_implicit_collection_creation_in_txn.js on repeat (I did it 100 times). |
| Comment by Max Hirschhorn [ 14/Feb/19 ] |
If the test fails due to an SnapshotTooOld error response, then it might be (a) the server has a bug where it isn't preserving enough history, or (b) the test is race-prone and should use the WTPreserveSnapshotHistoryIndefinitely failpoint to preserve more history. Either way, it sounds reasonable to me to address that separately if it ever comes up. |
| Comment by Jack Mulrow [ 14/Feb/19 ] |
|
That's a good point about raising the max lock timeout. From looking through the linked BFs and their duplicate BFGs, it looks like every one failed trying to take a lock during commit or prepare and it should be fine to raise the limit for the core/txns tests since I don't think their transactions rely on the timeout (except the ones explicitly testing it, where we can use the original value). My only problem with this approach is that it will still be possible for a targeted transaction test to fail because of an unexpected SnapshotTooOld on a slow host, but I can't actually find a failure because of that, so I'd be fine addressing that later if it turns out to be a problem. |
| Comment by Max Hirschhorn [ 13/Feb/19 ] |
jack.mulrow, I'd be worried that it may mean if a test does enough multi-statement transactions, then the likelihood of one of them failing with a TransientTransactionError is almost sure, and would therefore cause the entire test to be rerun continuously. Retrying just the multi-statement transaction would increase the likelihood of making forward progress in the test. My understanding is that the tests in the jstests/core/txns/ directory weren't written with the expectation that they'd ever encounter transient transactions errors. I think trying to do implicit retries is going to be difficult for some of the reasons you mentioned around mutable state (e.g. the cursor id). Is it possible to change the server's configuration in the sharded_*_txns*.yml test suites by raising the maxTransactionLockRequestTimeoutMillis server parameter to avoid encountering transient transaction errors? For example, the testing infrastructure sets the transactionLifetimeLimitSeconds server parameter to 3 hours to avoid transaction being killed due to the Evergreen machine being slow. |
| Comment by Jack Mulrow [ 11/Feb/19 ] |
|
I've spent a little while working on this ticket, and I'm realizing there are a few problems with the override in the description. The main ones are:
To resolve the linked BFs, we could write an override without handling these cases and manually add retries to tests that aren't covered, but that makes me think we should try to find a more generic fix instead. In particular, since we don't expect these errors to happen often, I'm wondering if the test infrastructure can somehow retry an entire jstest when it fails with a transient transaction error instead of making the tests retry internally. max.hirschhorn, what do you think? |