Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.2.2
Component/s: Replication, Stability
Labels:
- cluster
- dns
- replication
- replset
Environment:
Linux 2.6.35.14-103.47.amzn1.x86_64, Amazon EC2

Operating System:
Linux
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We're running a couple of MongoDB clusters in AWS EC2 and the other day one of our clusters started to misbehave.

It turned out that the PRIMARY node was unable to contact the other nodes in the cluster. In it's logs we could see the following lines repeated (for brewity I've only included one host in the lines, but it showed the same for the third node as well):

Sat Dec 14 01:55:54 [rsHealthPoll] getaddrinfo("host2") failed: Temporary failure in name resolution
Sat Dec 14 01:55:54 [rsHealthPoll] couldn't connect to host2:27017: couldn't connect to server host2:27017

On the other nodes we observed the following in their logs:

Sat Dec 14 01:56:24 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: host1:27017
Sat Dec 14 01:56:24 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: host1:27017
Sat Dec 14 01:56:24 [rsHealthPoll] replSet member host1:27017 is now in state SECONDARY
Sat Dec 14 01:56:24 [rsMgr] not electing self, host1:27017 would veto

The first thing we tested was to manually add the other nodes (host2 and host3) IP addresses to host1's /etc/hosts file, and within a minute the issue seemed resolved (i.e. it could talk to the other nodes again), but it was hardly a permanent solution.

We continued by running rs.stepDown() through the mongo console, removed the manually added lines in /etc/hosts and then restarted mongod on host1. When it booted up it was once again able to connect to the other nodes.

It should be noted that while host1 was unable to contact either host2 and host3, we were able to use dig to verify that our internal DNS was able to resolve the DNS name, and we could SSH from on host to the other. Our other services, that are using the same DNS server, showed no similar behaviour (i.e. problems with DNS resolving), and we also have an alarm check in place for DNS resolving, which didn't trigger.

Another thing of interest is that host1 has been running mongod since 27th of November 2012 while the other hosts (host2 and host3) were recently restarted (23th of November 2013) when we had a similar issue (though then we ended up with two PRIMARY nodes).

Let me know if there is anything else that you regarding this (i.e. log files, etc, etc) and I'll see what I can do.

Assignee:: Unassigned
Reporter:: Mathias Söderberg
Participants:: Eliot Horowitz, Mathias Söderberg
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Dec 15 2013 11:27:17 AM UTC
Updated:: Dec 10 2014 11:05:55 PM UTC
Resolved:: Mar 05 2014 09:07:14 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates