[SERVER-40493] Make Interrupted a retryable writes error Created: 05/Apr/19 Updated: 29/Oct/23 Resolved: 14/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.12 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Randolph Tan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | todo_in_code | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Sharding 2019-05-20 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 14 | ||||||||||||||||||||||||
| Description |
|
Interrupted is not in the retryable writes spec, while InterruptedAtShutdown and InterruptedDueToStepDown are. There are other related codes that we also should consider adding. This would include changing the drivers spec, the shell code, and the RetryableWritesUtil. |
| Comments |
| Comment by Githook User [ 02/Dec/19 ] | ||||||||
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@mongodb.com'}Message: | ||||||||
| Comment by Shane Harvey [ 14/May/19 ] | ||||||||
|
I think this ticket still has some open discussion about making more errors retryable in Drivers. Do we still want drivers to retry writes that fail with ExceededTimeLimit, LockTimeout, and ClientDisconnect? I would prefer a more general solution to this problem where the server returns an errorLabel indicating that the write is safe to retry, e.g. {ok:0, code: 24, codeName: "LockTimeout", errorLabels: ["RetryableWriteError"]}. Is there a ticket for using errorLabels in retryable write responses this already? | ||||||||
| Comment by Githook User [ 14/May/19 ] | ||||||||
|
Author: {'email': 'randolph@10gen.com', 'name': 'Randolph Tan', 'username': 'renctan'}Message: | ||||||||
| Comment by Randolph Tan [ 13/May/19 ] | ||||||||
So far driver specs make sense to me.
We currently have targeted failover tests for this.
Filed
This should be covered by I looked all Interrupted code usage and the only place I think needs to change is ShardServerCatalogCacheLoader, so I'm just going to use this ticket to fix it. | ||||||||
| Comment by Randolph Tan [ 09/May/19 ] | ||||||||
I don't think so. As far as I know none of the write ops would register a cursor to the cursor manager. This applies more to queries/agg.
Yes. You can get this error from trying to acquire collection/db locks and writes do go through the path.
It means that the remote connection was severed. This makes more sense when there are multiple network hops, for example, driver sends write to mongos, then mongos sending write commands to shards. So, it is possible to get this error during retryable write. | ||||||||
| Comment by Judah Schvimer [ 08/Apr/19 ] | ||||||||
|
We need to make sure that:
| ||||||||
| Comment by Judah Schvimer [ 05/Apr/19 ] | ||||||||
|
shane.harvey thank you for the detailed response!
Right now sharding returns Interrupted in some places where we maybe would want to retry. The shell returns Interrupted if it gets interrupted in server-side javascript (that's where this arose from). It's possible stepdown also does, we'd have to do further investigation. I think it would be reasonable to change these to return a different response, or this can just be a place where our tests must differ from the driver spec.
This requires further investigation. | ||||||||
| Comment by Tess Avitabile (Inactive) [ 05/Apr/19 ] | ||||||||
Yes, now I remember! Thank you for reminding us. | ||||||||
| Comment by Shane Harvey [ 05/Apr/19 ] | ||||||||
|
We chose not to retry after Interrupted because that error code is returned when an admin issues a killOp. Retrying after this error would then override the admin's decision to kill the operation. Are there any other situations where Interrupted is returned and a retry would be legitimate? As for the other error codes in the "Interruption" class:
|