[SERVER-7249] poor error message and arguably shouldnt even give an error Created: 03/Oct/12  Updated: 10/Dec/14  Resolved: 02/May/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Dwight Merriman Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

in a system with 3 config servers, i shut down one of them. i then did the following which returned an error. (1) the message isn't very clear i would really like to know that a config server was unreachable not a mongod for example. (2) arguably i shouldn't get an error at all the query could be retried to another server. however be careful retries could add latency if multiple down and/or have a multiplicative effect on load. also is there a background detector of heartbeat? that might help some as another approach.



 Comments   
Comment by Greg Studer [ 02/May/14 ]

Think this issue has become stale - we've clarified config messages a bit as well.

Full fix will be when we move away from SCC - it prevents us from being smarter here.

Comment by Dwight Merriman [ 03/Oct/12 ]

oops i had a couple of these and failed to post the example.

one was something like this:

        ss << _ei.code << " socket exception [" << _type << "] ";

another was "all servers down!"

another:

Wed Oct 03 11:59:17 [conn68] warning: db exception when initializing on config:dm_hp:27019,dm_hp:27020,dm_hp:27021, current connection state is { state: { conn: "SyncClusterConnection [dm_hp:27019,dm_
hp:27020,dm_hp:27021]", vinfo: "config:dm_hp:27019,dm_hp:27020,dm_hp:27021", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by ::
8008 all servers down!

The last one has some detail in the log but tine getlasterror message returned was very vague.

basically i would recommend simulating different kinds of failiures (mongod, cluster managers) and see if the getlasterror results give you an inkling as to what is up. if config servers are down we should say that. if shard 17 or 200 is down we should say that rather than a host name.

Generated at Thu Feb 08 03:14:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.