[SERVER-7507] Random mongos failure to contact whole cluster Created: 30/Oct/12  Updated: 10/Dec/14  Resolved: 28/May/13

Status: Closed
Project: Core Server
Component/s: Networking, Replication, Sharding
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: noizwaves Assignee: Randolph Tan
Resolution: Incomplete Votes: 0
Labels: nh-240
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AWS, Ubuntu 12.04.1 LTS
2x shards (each shard consists of 2x replicas and 1x abriter)

2x app servers (each running mongos)
1x background worker (running mongos)


Attachments: File mongo_send_error.tar.gz     Text File mongos-2.log     Text File mongos.log    
Operating System: Linux
Participants:

 Description   

Hi,

During routine operation of our mongo cluster, the mongos process on one of our app servers became unresponsive (confirmed by ssh'ing to the app server, running mongo, and running 'show dbs').

Attached is the mongos.log file from when the issue started, until after mongos was manually restarted and recovered. The machine maintained full network connectivity during this time, and DNS names were resolving in shell.

During this time, the other app server and background worker show clean mongos.logs (just acquiring and unlocking the distributed lock).

How can we prevent this happening in future? This kind of failure is critical for us, and I'm happy to help debug/diagnose it further.



 Comments   
Comment by Barrie Segal [ 11/Apr/13 ]

Adam,

Are you still seeing this issue? Have you been able to try upgrading to 2.2.4?

Barrie

Comment by Randolph Tan [ 12/Feb/13 ]

Hi,

Would you mind elaborating on what kind of failure are you seeing? Are you referring to the socket exceptions in the mongos logs?

Comment by noizwaves [ 15/Jan/13 ]

Hi, have there been any developments with this? I hate to nag but this is causing is sporadic and random critical errors in our system affecting our uptime.

We are happy to help debug this in any way we can.

Comment by noizwaves [ 19/Dec/12 ]

Hey, we are consistently seeing these errors every day now. Is there anything more we can do escalate this issue? Happy to debug anything from our end.

Cheers, Adam

Comment by noizwaves [ 13/Dec/12 ]

Thanks for the tips Eliot. We've updated to 2.2.1 and this did not resolve the issue. We've been encountering it more frequently lately, so I'll try to capture a dump. (We've also bumped logging up to vvvvv for the moment as well).

Comment by Eliot Horowitz (Inactive) [ 31/Oct/12 ]

A little hard to diagnose with this info.
Few things:

  • can you upgrade to 2.2.1 - various fixes could account, though not 100%
  • if it happens again, can you attach with gdb and get a dump so we can look through it?
  • also if it happens again, can you run top and tell us if cpu is spiked or idle?
Comment by noizwaves [ 30/Oct/12 ]

Hi, the issue has happened again to the same machine. This time, mongos was able to come back online. Any guidance on diagnosing this issue would be appreciated.

Thanks,

Adam

Comment by noizwaves [ 30/Oct/12 ]

mongos log file from second issue occurrence

Generated at Thu Feb 08 03:14:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.