Differentiate ReplicaSetMonitor removal error codes: shard removal vs. process shutdown

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • 200
    • 🟩 Routing and Topology
    • None
    • None
    • None
    • None
    • None
    • None

      When a ReplicaSetMonitor (RSM) is dropped, StreamableReplicaSetMonitor::getHostOrRefresh() unconditionally returns ShutdownInProgress (error code 91) via makeReplicaSetMonitorRemovedError(). This was introduced by SERVER-51329, which changed the error from the non-retriable ReplicaSetMonitorRemoved (199) to the retriable ShutdownInProgress (91) so that external clients (drivers, mongos) would transparently retry when a mongos is shutting down.

      However, there are two fundamentally different scenarios that trigger RSM removal:

      1. Process shutdown — the mongos/mongod is shutting down, and all RSMs are dropped. In this case, ShutdownInProgress is the correct error: the client should retry against another mongos.
      1. Shard removal — a shard is removed from the cluster (e.g., via removeShard), and the ShardRegistry drops the RSM for the removed replica set. In this case, ShutdownInProgress is misleading: the process is still healthy, but the shard no longer exists.

      Returning ShutdownInProgress for shard removal causes server-internal retry loops (such as withAutomaticRetry in resharding, oplog fetching, etc.) to treat it as a transient error and retry indefinitely against a shard that will never come back. While the immediate trigger for infinite retries in resharding was fixed by SERVER-123567 (which prevents the RSM from being poisoned in the first place), the underlying ambiguity of the error code remains a latent risk for any code path that holds a shared_ptr<Shard> with a dropped RSM.

      Suggested approach:

      Introduce a way to distinguish the two cases at the RSM level. For example:

      • Add a DropReason enum (e.g., kShutdownkShardRemoved) to StreamableReplicaSetMonitor::drop().
      • When the reason is kShardRemoved, return ReplicaSetMonitorRemoved (199) or ShardNotFound (70) — a non-retriable error that signals the shard is gone.
      • When the reason is kShutdown, continue returning ShutdownInProgress (91) to preserve the SERVER-51329 behavior for external clients.

      This would make internal retry loops correctly fail fast on shard removal while preserving transparent retry semantics for process shutdown.

            Assignee:
            Unassigned
            Reporter:
            Igor Praznik
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: