-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
200
-
🟩 Routing and Topology
-
None
-
None
-
None
-
None
-
None
-
None
When a ReplicaSetMonitor (RSM) is dropped, StreamableReplicaSetMonitor::getHostOrRefresh() unconditionally returns ShutdownInProgress (error code 91) via makeReplicaSetMonitorRemovedError(). This was introduced by SERVER-51329, which changed the error from the non-retriable ReplicaSetMonitorRemoved (199) to the retriable ShutdownInProgress (91) so that external clients (drivers, mongos) would transparently retry when a mongos is shutting down.
However, there are two fundamentally different scenarios that trigger RSM removal:
- Process shutdown — the mongos/mongod is shutting down, and all RSMs are dropped. In this case, ShutdownInProgress is the correct error: the client should retry against another mongos.
- Shard removal — a shard is removed from the cluster (e.g., via removeShard), and the ShardRegistry drops the RSM for the removed replica set. In this case, ShutdownInProgress is misleading: the process is still healthy, but the shard no longer exists.
Returning ShutdownInProgress for shard removal causes server-internal retry loops (such as withAutomaticRetry in resharding, oplog fetching, etc.) to treat it as a transient error and retry indefinitely against a shard that will never come back. While the immediate trigger for infinite retries in resharding was fixed by SERVER-123567 (which prevents the RSM from being poisoned in the first place), the underlying ambiguity of the error code remains a latent risk for any code path that holds a shared_ptr<Shard> with a dropped RSM.
Suggested approach:
Introduce a way to distinguish the two cases at the RSM level. For example:
- Add a DropReason enum (e.g., kShutdown, kShardRemoved) to StreamableReplicaSetMonitor::drop().
- When the reason is kShardRemoved, return ReplicaSetMonitorRemoved (199) or ShardNotFound (70) — a non-retriable error that signals the shard is gone.
- When the reason is kShutdown, continue returning ShutdownInProgress (91) to preserve the
SERVER-51329behavior for external clients.
This would make internal retry loops correctly fail fast on shard removal while preserving transparent retry semantics for process shutdown.
- is related to
-
SERVER-51329 Unexpected non-retryable error when shutting down a mongos server
-
- Closed
-
-
SERVER-123567 ShardRegistry should not remove RSM when a live shard still uses the same replica set name
-
- Closed
-