[SERVER-39625] Add ShutdownInProgress to the list of kIdempotent retryable errors Created: 15/Feb/19 Updated: 27/Oct/23 Resolved: 31/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Esha Maharishi (Inactive) | Assignee: | Amirsaman Memaripour |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | sharding-4.4-stabilization, sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Linked BF Score: | 6 | ||||
| Description |
|
Add it to RemoteCommandRetryScheduler::kAllRetriableErrors. |
| Comments |
| Comment by Amirsaman Memaripour [ 31/Mar/20 ] |
|
This commit has already addressed this issue. |
| Comment by Kaloian Manassiev [ 15/May/19 ] |
|
This ticket brings up a good point about cluster and replica-set resiliency to shutdowns of servers due to maintenance and I think Esha has a good point about adding ShutdownInProgress to the list of retryable errors. This needs to be coordinated with the drivers as well and since it has been like that forever, I am booting it out of the 4.2.0 blockers. |
| Comment by Esha Maharishi (Inactive) [ 21/Feb/19 ] |
|
kaloian.manassiev yeah exactly, ReplicationCoordinatorImpl::_awaitReplication_inlock can return ShutdownInProgress here. Yeah in the case of the BF, a secondary is expected to step up. |
| Comment by Kaloian Manassiev [ 21/Feb/19 ] |
|
Is the issue that when a node from a replica set is shutting down, it is sometimes possible that the caller gets ShutdownInProgress instead of InterruptedDueToStepDown, depending on what stage of the execution it happened to be? I believe at the time the idea was that if a node is shutting down, having the caller retry doesn't help, because there is no guarantee that the node will be restarted after the shutdown and that's why we didn't add it. From reading the BF it looks like the cause might be different? |
| Comment by Esha Maharishi (Inactive) [ 19/Feb/19 ] |
|
kaloian.manassiev hm I'm not sure, InterruptedDueToStepDown is in the list and should cause similar issues as ShutdownInProgress... (for shard->shard communication, since routers never generate InterruptedDueToStepDown). |
| Comment by Kaloian Manassiev [ 15/Feb/19 ] |
|
This would not be easy to do, because we rely on this code also to discover local shut-downs, don't we? |