[SERVER-6178] Cannot use mongos if subset of config servers can't read from or write to disk Created: 22/Jun/12 Updated: 11/Jul/16 Resolved: 05/Jul/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6 |
| Fix Version/s: | 2.0.7, 2.2.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Barlocker | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 0 |
| Labels: | configsrv, mongos | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Any |
||
| Issue Links: |
|
||||||||||||
| Operating System: | Linux | ||||||||||||
| Participants: | |||||||||||||
| Description |
| Comments |
| Comment by Greg Studer [ 05/Jul/12 ] | |||||||||||||||||
|
Created a new ticket to track the changes to config server timeouts once mongos has successfully started, linked above. | |||||||||||||||||
| Comment by Matthew Barlocker [ 28/Jun/12 ] | |||||||||||||||||
|
Yes, as far as I can remember that message looks similar to the one I ran into. | |||||||||||||||||
| Comment by auto [ 28/Jun/12 ] | |||||||||||||||||
|
Author: {u'date': u'2012-06-28T15:22:10-07:00', u'email': u'greg@10gen.com', u'name': u'Greg Studer'}Message: Backport of commit 29253bec3ba365668d503ca015c4e9a7f4cc3f0d . Signed-off-by: Tad Marshall <tad@10gen.com> | |||||||||||||||||
| Comment by auto [ 28/Jun/12 ] | |||||||||||||||||
|
Author: {u'date': u'2012-06-28T12:32:50-07:00', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: blah | |||||||||||||||||
| Comment by Greg Studer [ 28/Jun/12 ] | |||||||||||||||||
|
Secondary issue is that certain connections to the config server don't time out after 30s. This leads to problems as it is possible to connect to the mongos (once it starts up, which takes ~2mins) but certain operations can hang when trying to reload config data. | |||||||||||||||||
| Comment by Greg Studer [ 28/Jun/12 ] | |||||||||||||||||
|
Issue above seems to be that we catch SocketExceptions and not DBExceptions when we checkConfigServersConsistent in mongos main - if NFS is down, this causes a cursor exception (DBException) which then slips through and terminates mongos. Mongos does start up after changing this to a DBException, however since the affected config server continues to accept new connections (it just doesn't return data after accepting), the startup process continues to require 30s timeouts and is very slow. Some way of marking the server as "bad" even when the server responds successfully to non-disk operations would be required to avoid this, which is tricky. | |||||||||||||||||
| Comment by Greg Studer [ 28/Jun/12 ] | |||||||||||||||||
|
Note - problem is not reproduced if all packets to server (port 3000x) are simply dropped, only reproduced if packets to nfs are dropped (port 2049). | |||||||||||||||||
| Comment by Greg Studer [ 28/Jun/12 ] | |||||||||||||||||
|
Hmm... tried to reproduce on my end (Ubuntu 10.10), and while I'm able to successfully hang the config server (assuming the hanging server is second in the mongos list), I'm not able to reproduce the mongos crash if the config server is second in the list. I am able to reproduce a seemingly similar error if the config server down is the first in the list :
Is this similar to what you saw? | |||||||||||||||||
| Comment by Matthew Barlocker [ 27/Jun/12 ] | |||||||||||||||||
|
Unfortunately, I shut down the servers that I used to duplicate the issue, and the logs have already rotated out for my production servers. My only way of getting the logs is to reproduce using the steps given above. | |||||||||||||||||
| Comment by Greg Studer [ 27/Jun/12 ] | |||||||||||||||||
|
We'll start trying to reproduce this issue on our side, thanks for the detailed bug report. There's plenty of information for us to get started, but it would also be helpful if you could post a sample mongos log (especially at high verbosity -vvvvv) if it's easy for you to do so, to verify that we're reproducing the same problem. |