[SERVER-7249] poor error message and arguably shouldnt even give an error Created: 03/Oct/12 Updated: 10/Dec/14 Resolved: 02/May/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Dwight Merriman | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Participants: |
| Description |
|
in a system with 3 config servers, i shut down one of them. i then did the following which returned an error. (1) the message isn't very clear i would really like to know that a config server was unreachable not a mongod for example. (2) arguably i shouldn't get an error at all the query could be retried to another server. however be careful retries could add latency if multiple down and/or have a multiplicative effect on load. also is there a background detector of heartbeat? that might help some as another approach. |
| Comments |
| Comment by Greg Studer [ 02/May/14 ] | ||||
|
Think this issue has become stale - we've clarified config messages a bit as well. Full fix will be when we move away from SCC - it prevents us from being smarter here. | ||||
| Comment by Dwight Merriman [ 03/Oct/12 ] | ||||
|
oops i had a couple of these and failed to post the example. one was something like this:
another was "all servers down!" another:
The last one has some detail in the log but tine getlasterror message returned was very vague. basically i would recommend simulating different kinds of failiures (mongod, cluster managers) and see if the getlasterror results give you an inkling as to what is up. if config servers are down we should say that. if shard 17 or 200 is down we should say that rather than a host name. |