Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Catalog and Routing
Linked BF Score:
200
CAR Domain/s:

🟩 Routing and Topology

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When a ReplicaSetMonitor (RSM) is dropped, StreamableReplicaSetMonitor::getHostOrRefresh() unconditionally returns ShutdownInProgress (error code 91) via makeReplicaSetMonitorRemovedError(). This was introduced by ~~SERVER-51329~~, which changed the error from the non-retriable ReplicaSetMonitorRemoved (199) to the retriable ShutdownInProgress (91) so that external clients (drivers, mongos) would transparently retry when a mongos is shutting down.

However, there are two fundamentally different scenarios that trigger RSM removal:

Process shutdown — the mongos/mongod is shutting down, and all RSMs are dropped. In this case, ShutdownInProgress is the correct error: the client should retry against another mongos.

Shard removal — a shard is removed from the cluster (e.g., via removeShard), and the ShardRegistry drops the RSM for the removed replica set. In this case, ShutdownInProgress is misleading: the process is still healthy, but the shard no longer exists.

Returning ShutdownInProgress for shard removal causes server-internal retry loops (such as withAutomaticRetry in resharding, oplog fetching, etc.) to treat it as a transient error and retry indefinitely against a shard that will never come back. While the immediate trigger for infinite retries in resharding was fixed by ~~SERVER-123567~~ (which prevents the RSM from being poisoned in the first place), the underlying ambiguity of the error code remains a latent risk for any code path that holds a shared_ptr<Shard> with a dropped RSM.

Suggested approach:

Introduce a way to distinguish the two cases at the RSM level. For example:

Add a DropReason enum (e.g., kShutdown, kShardRemoved) to StreamableReplicaSetMonitor::drop().
When the reason is kShardRemoved, return ReplicaSetMonitorRemoved (199) or ShardNotFound (70) — a non-retriable error that signals the shard is gone.
When the reason is kShutdown, continue returning ShutdownInProgress (91) to preserve the ~~SERVER-51329~~ behavior for external clients.

This would make internal retry loops correctly fail fast on shard removal while preserving transparent retry semantics for process shutdown.

is related to

SERVER-51329 Unexpected non-retryable error when shutting down a mongos server

Closed

SERVER-123567 ShardRegistry should not remove RSM when a live shard still uses the same replica set name

Closed

Assignee:: Unassigned
Reporter:: Igor Praznik
Participants:: Igor Praznik
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Apr 21 2026 01:19:06 PM UTC
Updated:: Apr 23 2026 09:47:09 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates