[SERVER-59880] Don't mark transaction timeout errors as retryable Created: 10/Sep/21  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Replication
Participants:

 Description   

When a transaction exceeds transactionLifetimeLimitSeconds, we currently label the error a TransientTransactionError. This label instructs the client to retry the transaction, but it will probably fail again for the same reason. Modern drivers that implement timeoutMS may retry forever. Let's omit the TransientTransactionError label from transaction timeout errors.



 Comments   
Comment by A. Jesse Jiryu Davis [ 15/Sep/21 ]

The driver timeout is infinity by default, which makes sense because the driver retries only "retryable" errors: those we judge likely to succeed after a few retries. I believe (without evidence) that a transaction which runs longer than transactionLifetimeLimitSeconds on its first try will probably do so on every try, so it shouldn't be labeled retryable.

Other things we label retryable have temporary causes: failover, write conflict, .... The cause of a too-long transaction is often permanent: the client is trying to do too much work in one transaction.

Comment by Judah Schvimer [ 15/Sep/21 ]

There are two timeouts at play here, the driver timeout and the server timeout. The driver timeout is set by an application developer saying "I want this request to try for this long". The server timeout is set by a database operator saying "we need to kill transactions that take longer than X time to fairly share and appropriately utilize resources". From my perspective, we should be using the driver's timeout here, and retrying on timeout errors on the server, in case that timeout doesn't happen again. I think there are many cases, especially around lock contention, where a retry would succeed even if the original operation timed out. That said, that would probably be a stronger argument for getting rid of the server side timeout, and retrying work the server just threw away of its own accord does feel wasteful.

Generated at Thu Feb 08 05:48:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.