We're running a couple of MongoDB clusters in AWS EC2 and the other day one of our clusters started to misbehave.
It turned out that the PRIMARY node was unable to contact the other nodes in the cluster. In it's logs we could see the following lines repeated (for brewity I've only included one host in the lines, but it showed the same for the third node as well):
Sat Dec 14 01:55:54 [rsHealthPoll] getaddrinfo("host2") failed: Temporary failure in name resolution
Sat Dec 14 01:55:54 [rsHealthPoll] couldn't connect to host2:27017: couldn't connect to server host2:27017
On the other nodes we observed the following in their logs:
Sat Dec 14 01:56:24 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: host1:27017
Sat Dec 14 01:56:24 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: host1:27017
Sat Dec 14 01:56:24 [rsHealthPoll] replSet member host1:27017 is now in state SECONDARY
Sat Dec 14 01:56:24 [rsMgr] not electing self, host1:27017 would veto
The first thing we tested was to manually add the other nodes (host2 and host3) IP addresses to host1's /etc/hosts file, and within a minute the issue seemed resolved (i.e. it could talk to the other nodes again), but it was hardly a permanent solution.
We continued by running rs.stepDown() through the mongo console, removed the manually added lines in /etc/hosts and then restarted mongod on host1. When it booted up it was once again able to connect to the other nodes.
It should be noted that while host1 was unable to contact either host2 and host3, we were able to use dig to verify that our internal DNS was able to resolve the DNS name, and we could SSH from on host to the other. Our other services, that are using the same DNS server, showed no similar behaviour (i.e. problems with DNS resolving), and we also have an alarm check in place for DNS resolving, which didn't trigger.
Another thing of interest is that host1 has been running mongod since 27th of November 2012 while the other hosts (host2 and host3) were recently restarted (23th of November 2013) when we had a similar issue (though then we ended up with two PRIMARY nodes).
Let me know if there is anything else that you regarding this (i.e. log files, etc, etc) and I'll see what I can do.