[SERVER-22107] Improve error message when ReplicaSetMonitor cannot connect to a replSet node in mongos Created: 08/Jan/16  Updated: 06/Dec/22  Resolved: 12/Dec/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.11
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Emily Stolfo Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 0
Labels: PM550
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-23192 mongos and shards will become unusabl... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

When mongos cannot connect to any of the members of a shard replica set for extended periods of time (> 5 minutes), it will remove the ReplicaSetMonitor for that particular set in memory. The consequence is that it will start returning "unknown replica set" error instead of the usual cannot connect to host X.

Original summary:
'unknown replica set' error message from mongos during failover
Original description:

We had a user report that he got an error from mongos during a failover:

unknown replica set d303f52f76534680956e00a707bbca89 (71)

This doesn't seem to be an error that should make its way back to the user. Is this possibly a bug in mongos?

Here is the relevant report from the user.



 Comments   
Comment by Andy Schwerin [ 12/Dec/16 ]

As of 3.2.11 and 3.4.0, mongos no longer forgets about replica sets it's had trouble contacting, so the message should not longer appear.

Comment by Bernie Hackett [ 13/Jan/16 ]

I see. I misunderstood and thought you were saying it returned this error if it couldn't communicate with any individual seed. The problem is when it can't talk to the shard at all. In that case, I think the error message needs some work. The current message makes it sound like the replica set's setName changed or something.

Comment by Randolph Tan [ 13/Jan/16 ]

I believe that is what is happening - mongos can't complete the request because it can't connect to the shard.

Comment by Bernie Hackett [ 13/Jan/16 ]

But why is that reported back to the client? This seems like something that should be logged by mongos. It should only cause a client side error if mongos can't complete the request.

Comment by Randolph Tan [ 13/Jan/16 ]

Mongos can return this error message if it cannot connect to any node in the seed list (for mongos, the seed list is extracted from config.shards, which is updated whenever mongos detect membership changes). I propose that we change the message to something that gives more context.

Note that mongos also 'forgets' cache replica sets if it cannot contact any of it's members for 5 minutes. It can be repopulated again when it needs to talk to the replica set and at least one member in the seed list can be reached.

Generated at Thu Feb 08 03:59:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.