[SERVER-39625] Add ShutdownInProgress to the list of kIdempotent retryable errors Created: 15/Feb/19  Updated: 27/Oct/23  Resolved: 31/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Amirsaman Memaripour
Resolution: Gone away Votes: 0
Labels: sharding-4.4-stabilization, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Participants:
Linked BF Score: 6

 Description   

Add it to RemoteCommandRetryScheduler::kAllRetriableErrors.



 Comments   
Comment by Amirsaman Memaripour [ 31/Mar/20 ]

This commit has already addressed this issue.

Comment by Kaloian Manassiev [ 15/May/19 ]

This ticket brings up a good point about cluster and replica-set resiliency to shutdowns of servers due to maintenance and I think Esha has a good point about adding ShutdownInProgress to the list of retryable errors. This needs to be coordinated with the drivers as well and since it has been like that forever, I am booting it out of the 4.2.0 blockers.

Comment by Esha Maharishi (Inactive) [ 21/Feb/19 ]

kaloian.manassiev yeah exactly, ReplicationCoordinatorImpl::_awaitReplication_inlock can return ShutdownInProgress here.

Yeah in the case of the BF, a secondary is expected to step up.

Comment by Kaloian Manassiev [ 21/Feb/19 ]

Is the issue that when a node from a replica set is shutting down, it is sometimes possible that the caller gets ShutdownInProgress instead of InterruptedDueToStepDown, depending on what stage of the execution it happened to be?

I believe at the time the idea was that if a node is shutting down, having the caller retry doesn't help, because there is no guarantee that the node will be restarted after the shutdown and that's why we didn't add it. From reading the BF it looks like the cause might be different?

Comment by Esha Maharishi (Inactive) [ 19/Feb/19 ]

kaloian.manassiev hm I'm not sure, InterruptedDueToStepDown is in the list and should cause similar issues as ShutdownInProgress... (for shard->shard communication, since routers never generate InterruptedDueToStepDown).

Comment by Kaloian Manassiev [ 15/Feb/19 ]

This would not be easy to do, because we rely on this code also to discover local shut-downs, don't we?

Generated at Thu Feb 08 04:52:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.