[SERVER-7507] Random mongos failure to contact whole cluster Created: 30/Oct/12 Updated: 10/Dec/14 Resolved: 28/May/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Replication, Sharding |
| Affects Version/s: | 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | noizwaves | Assignee: | Randolph Tan |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | nh-240 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
AWS, Ubuntu 12.04.1 LTS 2x app servers (each running mongos) |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
Hi, During routine operation of our mongo cluster, the mongos process on one of our app servers became unresponsive (confirmed by ssh'ing to the app server, running mongo, and running 'show dbs'). Attached is the mongos.log file from when the issue started, until after mongos was manually restarted and recovered. The machine maintained full network connectivity during this time, and DNS names were resolving in shell. During this time, the other app server and background worker show clean mongos.logs (just acquiring and unlocking the distributed lock). How can we prevent this happening in future? This kind of failure is critical for us, and I'm happy to help debug/diagnose it further. |
| Comments |
| Comment by Barrie Segal [ 11/Apr/13 ] |
|
Adam, Are you still seeing this issue? Have you been able to try upgrading to 2.2.4? Barrie |
| Comment by Randolph Tan [ 12/Feb/13 ] |
|
Hi, Would you mind elaborating on what kind of failure are you seeing? Are you referring to the socket exceptions in the mongos logs? |
| Comment by noizwaves [ 15/Jan/13 ] |
|
Hi, have there been any developments with this? I hate to nag but this is causing is sporadic and random critical errors in our system affecting our uptime. We are happy to help debug this in any way we can. |
| Comment by noizwaves [ 19/Dec/12 ] |
|
Hey, we are consistently seeing these errors every day now. Is there anything more we can do escalate this issue? Happy to debug anything from our end. Cheers, Adam |
| Comment by noizwaves [ 13/Dec/12 ] |
|
Thanks for the tips Eliot. We've updated to 2.2.1 and this did not resolve the issue. We've been encountering it more frequently lately, so I'll try to capture a dump. (We've also bumped logging up to vvvvv for the moment as well). |
| Comment by Eliot Horowitz (Inactive) [ 31/Oct/12 ] |
|
A little hard to diagnose with this info.
|
| Comment by noizwaves [ 30/Oct/12 ] |
|
Hi, the issue has happened again to the same machine. This time, mongos was able to come back online. Any guidance on diagnosing this issue would be appreciated. Thanks, Adam |
| Comment by noizwaves [ 30/Oct/12 ] |
|
mongos log file from second issue occurrence |