[SERVER-40493] Make Interrupted a retryable writes error Created: 05/Apr/19  Updated: 29/Oct/23  Resolved: 14/May/19

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: 4.1.12

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by DRIVERS-651 Make ExceededTimeLimit retryable writ... Closed
Related
related to SERVER-39890 Make network_error_and_txn_override.j... Closed
related to SERVER-44519 Label ExceededTimeLimit (262) errors ... Closed
related to SERVER-43480 Remove TODO for SERVER-40493 Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2019-05-20
Participants:
Linked BF Score: 14

 Description   

Interrupted is not in the retryable writes spec, while InterruptedAtShutdown and InterruptedDueToStepDown are. There are other related codes that we also should consider adding.

This would include changing the drivers spec, the shell code, and the RetryableWritesUtil.



 Comments   
Comment by Githook User [ 02/Dec/19 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@mongodb.com'}

Message: SERVER-43480 Remove TODO for SERVER-40493
Branch: master
https://github.com/mongodb/mongo/commit/7c23ee69ad45c20153457278bbae859513d9fa5e

Comment by Shane Harvey [ 14/May/19 ]

I think this ticket still has some open discussion about making more errors retryable in Drivers. Do we still want drivers to retry writes that fail with ExceededTimeLimit, LockTimeout, and ClientDisconnect?

I would prefer a more general solution to this problem where the server returns an errorLabel indicating that the write is safe to retry, e.g. {ok:0, code: 24, codeName: "LockTimeout", errorLabels: ["RetryableWriteError"]}. Is there a ticket for using errorLabels in retryable write responses this already?

Comment by Githook User [ 14/May/19 ]

Author:

{'email': 'randolph@10gen.com', 'name': 'Randolph Tan', 'username': 'renctan'}

Message: SERVER-40493 Use retryable interrupt code in ShardServerCatalogCacheLoader
Branch: master
https://github.com/mongodb/mongo/commit/1ca8e6db734a51a453e50768f88b81a98615ae0b

Comment by Randolph Tan [ 13/May/19 ]

The driver spec provides the correct user behavior,

So far driver specs make sense to me.

The Transaction Coordinator provides the correct behavior when retrying commitTransaction, prepareTransaction, and abortTransaction,

We currently have targeted failover tests for this.

We do not get build failures,

Filed SERVER-41120, although it's currently masked by it not getting killed because of this check.

Our test suite retry logic matches the driver spec as similarly as possible.

This should be covered by SERVER-39890.

I looked all Interrupted code usage and the only place I think needs to change is ShardServerCatalogCacheLoader, so I'm just going to use this ticket to fix it.

Comment by Randolph Tan [ 09/May/19 ]

CursorKilled, is this error even possible as part of a retryable write?

I don't think so. As far as I know none of the write ops would register a cursor to the cursor manager. This applies more to queries/agg.

LockTimeout, I agree that we should retry on this code but again is this error even possible as part of a retryable write?

Yes. You can get this error from trying to acquire collection/db locks and writes do go through the path.

ClientDisconnect, what does this error mean and when would it be returned to a driver?

It means that the remote connection was severed. This makes more sense when there are multiple network hops, for example, driver sends write to mongos, then mongos sending write commands to shards. So, it is possible to get this error during retryable write.

Comment by Judah Schvimer [ 08/Apr/19 ]

We need to make sure that:

  1. The driver spec provides the correct user behavior,
  2. The Transaction Coordinator provides the correct behavior when retrying commitTransaction, prepareTransaction, and abortTransaction,
  3. We do not get build failures,
  4. Our test suite retry logic matches the driver spec as similarly as possible.
Comment by Judah Schvimer [ 05/Apr/19 ]

shane.harvey thank you for the detailed response!

Are there any other situations where Interrupted is returned and a retry would be legitimate?

Right now sharding returns Interrupted in some places where we maybe would want to retry. The shell returns Interrupted if it gets interrupted in server-side javascript (that's where this arose from). It's possible stepdown also does, we'd have to do further investigation. I think it would be reasonable to change these to return a different response, or this can just be a place where our tests must differ from the driver spec.

CursorKilled, is this error even possible as part of a retryable write?
LockTimeout, I agree that we should retry on this code but again is this error even possible as part of a retryable write?
ClientDisconnect, what does this error mean and when would it be returned to a driver?

This requires further investigation.

Comment by Tess Avitabile (Inactive) [ 05/Apr/19 ]

We chose not to retry after Interrupted because that error code is returned when an admin issues a killOp. Retrying after this error would then override the admin's decision to kill the operation.

Yes, now I remember! Thank you for reminding us.

Comment by Shane Harvey [ 05/Apr/19 ]

We chose not to retry after Interrupted because that error code is returned when an admin issues a killOp. Retrying after this error would then override the admin's decision to kill the operation.

Are there any other situations where Interrupted is returned and a retry would be legitimate?

As for the other error codes in the "Interruption" class:

error_class("Interruption", ["Interrupted",
                             "InterruptedAtShutdown",
                             "InterruptedDueToStepDown",
                             "ExceededTimeLimit",
                             "MaxTimeMSExpired",
                             "CursorKilled",
                             "LockTimeout",
                             "ClientDisconnect"])

  1. InterruptedAtShutdown and InterruptedDueToStepDown are already retried.
  2. ExceededTimeLimit, I agree that we should retry on this code. For some background see SERVER-35031.
  3. MaxTimeMSExpired, we do not retry this error because doing so would negate the purpose of maxTimeMS; to interrupt an operation after X milliseconds and yield control back to the application.
  4. CursorKilled, is this error even possible as part of a retryable write?
  5. LockTimeout, I agree that we should retry on this code but again is this error even possible as part of a retryable write?
  6. ClientDisconnect, what does this error mean and when would it be returned to a driver?
Generated at Thu Feb 08 04:55:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.