[SERVER-34256] On error within a transaction, expose in response whether transaction can be safely retried from the beginning Created: 02/Apr/18  Updated: 29/Oct/23  Resolved: 08/May/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-36580 Mongos should preserve error labels f... Closed
related to SERVER-34583 Clean up transaction error codes Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2018-04-23, Repl 2018-05-07, Repl 2018-05-21
Participants:

 Comments   
Comment by Githook User [ 08/May/18 ]

Author:

{'email': 'siyuan.zhou@mongodb.com', 'name': 'Siyuan Zhou', 'username': 'visualzhou'}

Message: SERVER-34256 Expose TransientTxnError error labels in response.
Branch: master
https://github.com/mongodb/mongo/commit/e6c9fa2388b3ee6f789b14c5f767ea4c01834090

Comment by A. Jesse Jiryu Davis [ 04/May/18 ]

Drivers generally have a root exception class, or something like that, which will I hope allow them all to add an errorLabel API to all exceptions they throw. We have some special exception classes for some exceptions (e.g. PyMongo has a DuplicateKeyError which inherits from OperationFailure), but that's ok. We'll add error labels to the root class.

Note, error labels don't change our exception class hierarchies. We won't throw an exception of a different class in response to receiving a server error label. We'll throw the same class of exception we did before, now with one or more labels added.

Comment by Spencer Brody (Inactive) [ 03/May/18 ]

Ah I understand.  We have special objects in the shell for representing various types of error returns from operations, and we needed to figure out which ones of those should expose the errorLabels.  I believe drivers expose all types of errors in the same way (as exceptions in the languages that support them, not sure about the others), but it would be good to confirm.  jesse

Comment by Siyuan Zhou [ 03/May/18 ]

The shell does not return the raw response for write commands, because it has the logic to merge responses from several requests of a single bulk execution. The shell has WriteCommandError, WriteResult, BulkWriteResult, BulkWriteError, WriteError and WriteConcernError classes to represent different types of results of a bulk execution. I imagine drivers have something similar and need to expose the new field errorLables somehow. In the shell, WriteCommandError will be returned (for insert()) or thrown (for insertOne() and insertMany()) for all TransientTxnErrors, so I exposed the errorLabels there in the patch set I'm working on.

Comment by Spencer Brody (Inactive) [ 02/May/18 ]

I don't follow - what do you mean by "wrapping the responses"?  What are you suggesting that the drivers need to do?

Comment by Siyuan Zhou [ 02/May/18 ]

max.hirschhorn told me that we are wrapping the responses for write commands in the shell. I'm not sure if that's true in drivers. If so, we need to update either the driver spec or the error handling spec to update the bulk api response format.

Comment by A. Jesse Jiryu Davis [ 30/Apr/18 ]

Good, that sounds like it's a simple server behavior for drivers to handle.

Comment by Siyuan Zhou [ 30/Apr/18 ]

It turns out write commands won't report transaction transient errors in the embedded writeErrors array, because any error on any write operation in a transaction will fail the whole batch by throwing the error.

Comment by Tess Avitabile (Inactive) [ 24/Apr/18 ]

We made a decision in the transactions leads meeting today that NoSuchTransaction should get the error label TransientTxnError. The rationale is that if we don't know why a transaction was aborted, so we don't know whether it is retryable, we should report that it is retryable be default. This ensures that if a transaction was aborted due to failover, it can be retried. If the transaction was in fact not retryable (but the client missed the response and did not receive the error), then the client will get the correct (non-retryable) error when they attempt to retry. This will make transactions more usable at the expense of sometimes retrying non-retryable transactions.

The one case of concern is when a transaction was aborted by killSessions. The application can get a NoSuchTransaction error, which is considered retryable, and attempt to retry the transaction, so the operator may need to kill the transaction again. We will consider whether we should do something for 4.0 to record the kill reason in the case of killSessions, so that we do not return a retryable error. But for the sake of this ticket, we will consider NoSuchTransaction a retryable error.

Comment by Siyuan Zhou [ 23/Apr/18 ]

At the end of the meeting on 4/19, we decided to have drivers to track failover and mark those errors that could happen both on the old and new primary during a failover as TransientTxnError. Just want to confirm that with driver team. shane.harvey and jesse, we may need to update the driver spec accordingly.

Comment by Shane Harvey [ 20/Apr/18 ]

Depending on the timing, I believe the running client can also get TransactionAborted (will be changed to NoSuchTransaction in SERVER-34583) because the primary kills transaction on stepdown. That's also not transient.

It's also possible to get a writeConcernError on the commit, such as:

{
  'ok': 1.0,
  'writeConcernError': {'code': 91, 'codeName': 'ShutdownInProgress', 'errmsg': 'Replication is being shut down'}, 
}

In this case, driver's would need to re-run the commitTransaction command to find out if the transaction reached the new primary.

Comment by Siyuan Zhou [ 19/Apr/18 ]

To summarize the discussion in the meeting, besides NotMaster and Shutdown error classes, we only mark WriteConflict and SnapshotUnavailable as transient errors. When a failover happens, the driver may get one of the NotMaster errors, e.g. PrimarySteppedDown or InterruptedDueToReplStateChange, or the driver may get a network error since the connection gets closed. NotMaster errors will be marked as transient by the server, and network errors (except that from commitTransaction) will be marked as transient by the driver.

Another possible scenario is that between the operations in a transaction, the driver notices the failover and changes its belief of the primary, then the following transaction operation will be sent to the new primary and get NoSuchTransaction, which is not transient.

Depending on the timing, I believe the running client can also get TransactionAborted (will be changed to NoSuchTransaction in SERVER-34583) because the primary kills transaction on stepdown. That's also not transient.

Comment by Siyuan Zhou [ 19/Apr/18 ]

ErrorCodes::ConflictingOperationInProgress is used in 2 places for transaction:

  1. Restarting a running transaction
  2. When a running transaction sees a unexpected higher txnNumber, due to migration.

ErrorCodes::TransactionAborted used for where the transaction is aborted unexpectedly. We don't distinguish the reasons.

ErrorCodes::TransactionTooOld is returned in beginTxn() when the txnNumber is less than the active one.

Comment by Tess Avitabile (Inactive) [ 19/Apr/18 ]

Yes, I agree that should be retryable.

Comment by Shane Harvey [ 19/Apr/18 ]

I think SnapshotUnavailable should be retryable as well: "Command failed with error 246: 'Unable to read from a snapshot due to pending collection catalog changes; please retry the operation."

Comment by Shane Harvey [ 18/Apr/18 ]

I think NoSuchTransaction will be hit frequently and should be considered retryable as well. It's my understanding that NoSuchTransaction could happen when running a command where:

  1. A previous command in the transaction failed and aborted the transaction (and the application ignored the error and ran another command).
  2. The server decided to abort the transaction itself.
  3. An election occurred and we're sending a command to a new primary that does not know about this transaction.

For 1 maybe we can tell users not to recover from transaction errors within the transaction but 2 and 3 seem like retryable errors to me.

Comment by Spencer Brody (Inactive) [ 18/Apr/18 ]

Yeah, I think killSessions shouldn't cause TransactionAborted. It should result in NoSuchTransaction, or maybe Interrupted? The reaper ideally would result in a code that indicated the transaction took too long, either ExceededTimeLimit or a new code. I think it's fine to just have both return NoSuchTransaction for 4.0, however. In 4.2 we plan on being smarter about the error codes we return in general. TransactionAborted is also likely to make a return in 4.2, but it's fine if it's unused in 4.0.

Comment by Tess Avitabile (Inactive) [ 18/Apr/18 ]

Currently, if there is an in-progress operation on the transaction, and the transaction is killed by killSessions or the reaper, the error returned will be TransactionAborted. If the transaction is killed by killSessions or the reaper while inactive, then the next attempted operation will return error NoSuchTransaction. Would it be preferable to make these both return NoSuchTransaction, and do away will TransactionAborted entirely?

It sounds like the only "transaction" error that should be considered retryable is WriteConflict.

Comment by A. Jesse Jiryu Davis [ 18/Apr/18 ]

Agreed that killSessions should cause TransactionNotFound, ideally. Would the driver that is using the transaction see that error, whether or not a transaction has an operation in progress when its session is killed?

Also agreed that TransactionNotFound is a permanent error, not a transient one.

Comment by Tess Avitabile (Inactive) [ 18/Apr/18 ]

I don't think we need to include ConflictingOperationInProgress, since there will not be chunk migrations concurrent with transactions in 4.0. In 4.2, hitting ConflictingOperationInProgress in a transaction would indicate a user/driver error, since a newer txnNumber was used on one shard while there was still an open transaction with a lower txnNumber on a different shard.

For NoSuchTransaction, I am not certain whether we want to say it is safe to retry, since there are several reasons why a transaction could have been aborted. For example, the user aborted it, it was aborted by killSessions, it was aborted for running too long. The drivers' scope doc says that if a transaction was killed by the DBA, it should not be considered safe to retry.

We only expect to get TransactionAborted if the transaction is aborted in the middle of a user operation on the transaction, i.e. the operation has the session checked out, but either killSessions or the reaper aborts the transaction. I think TransactionAborted should be treated the same as NoSuchTransaction.

I agree that TransactionTooOld should only happen in the case of user error.

Comment by Siyuan Zhou [ 18/Apr/18 ]

NoSuchTransaction generally means the expected transaction doesn't exist. It's used for
1) The first operation fails after beginTxn() before unstashTxn(), the next one will abort the transaction and return NoSuchTransaction. The case in SERVER-34219.
2) Start a transaction with higher txnNumber but without startTransaction field.
3) Given transaction number is already aborted.

TransactionTooOld can happen after a chunk migration.

git grep "ErrorCodes" – src/mongo/db/session.cpp gives all occurrences of error codes.

Comment by Spencer Brody (Inactive) [ 18/Apr/18 ]

Hmm, I don't think NoSuchTransaction should be included on this list, since we don't know anything about the transaction at that point. I'm also not sure about TransactionTooOld. That should probably only happen in the case of an application or driver bug, right?

I also think it might be a problem if killSession results in a transaction returning TransactionAborted. When would that happen? I'd expect that if a user tried to use a transaction that was killed by killSessions, that they'd get NoSuchTransaction, not TransactionAborted. My understanding at this point was that the only way you get TransactionAborted is if you re-use a transaction after explicitly calling abortTransaction.

CC jesse tess.avitabile shane.harvey

Comment by Siyuan Zhou [ 18/Apr/18 ]

Here's the list of transient transaction errors. TransactionAborted is interesting because the transaction may be aborted by killSession which implies the user doesn't want to retry the transaction.

// Transaction errors
ErrorCodes::ConflictingOperationInProgress  // e.g. chunk migration increases the txn number
ErrorCodes::NoSuchTransaction  // transaction has been aborted and not in the expected state
ErrorCodes::TransactionAborted  // transaction is aborted while the transaction is running
ErrorCodes::TransactionTooOld   // txnNumber < activeTxnNumber
ErrorCodes::WriteConflict  // classic concurrent write conflict.
 
// Topology change
ErrorCodes::NotMaster
ErrorCodes::NotMasterNoSlaveOk
ErrorCodes::NotMasterOrSecondary
ErrorCodes::PrimarySteppedDown
ErrorCodes::InterruptedDueToReplStateChange
 
// Shutdown
error_class("ShutdownError", ["ShutdownInProgress", "InterruptedAtShutdown"])

Comment by Spencer Brody (Inactive) [ 17/Apr/18 ]

For now, just basing it on code is sufficient. In the future we may get more context-aware, but we don't need to do anything for that now.

We should only be attaching the TransientTxnError label for failed transactions, however, not for all commands.

Comment by Siyuan Zhou [ 11/Apr/18 ]

spencer, is it possible for a certain error to sometimes belong to a given error class, but sometimes not? I'm trying to reason about whether we should categorize error codes based on the code themselves (in common code path) or based on the context (in transaction code path).

Generated at Thu Feb 08 04:36:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.