[SERVER-12099] getaddrinfo(hostname:port) - Temporary failure in name resolution Created: 15/Dec/13  Updated: 10/Dec/14  Resolved: 05/Mar/14

Status: Closed
Project: Core Server
Component/s: Replication, Stability
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mathias Söderberg Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: cluster, dns, replication, replset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6.35.14-103.47.amzn1.x86_64, Amazon EC2


Issue Links:
Duplicate
duplicates DOCS-5700 Call res_init() after a failure of ge... Closed
Operating System: Linux
Participants:

 Description   

We're running a couple of MongoDB clusters in AWS EC2 and the other day one of our clusters started to misbehave.

It turned out that the PRIMARY node was unable to contact the other nodes in the cluster. In it's logs we could see the following lines repeated (for brewity I've only included one host in the lines, but it showed the same for the third node as well):

Sat Dec 14 01:55:54 [rsHealthPoll] getaddrinfo("host2") failed: Temporary failure in name resolution
Sat Dec 14 01:55:54 [rsHealthPoll] couldn't connect to host2:27017: couldn't connect to server host2:27017

On the other nodes we observed the following in their logs:

Sat Dec 14 01:56:24 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: host1:27017
Sat Dec 14 01:56:24 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: host1:27017
Sat Dec 14 01:56:24 [rsHealthPoll] replSet member host1:27017 is now in state SECONDARY
Sat Dec 14 01:56:24 [rsMgr] not electing self, host1:27017 would veto

The first thing we tested was to manually add the other nodes (host2 and host3) IP addresses to host1's /etc/hosts file, and within a minute the issue seemed resolved (i.e. it could talk to the other nodes again), but it was hardly a permanent solution.

We continued by running rs.stepDown() through the mongo console, removed the manually added lines in /etc/hosts and then restarted mongod on host1. When it booted up it was once again able to connect to the other nodes.

It should be noted that while host1 was unable to contact either host2 and host3, we were able to use dig to verify that our internal DNS was able to resolve the DNS name, and we could SSH from on host to the other. Our other services, that are using the same DNS server, showed no similar behaviour (i.e. problems with DNS resolving), and we also have an alarm check in place for DNS resolving, which didn't trigger.

Another thing of interest is that host1 has been running mongod since 27th of November 2012 while the other hosts (host2 and host3) were recently restarted (23th of November 2013) when we had a similar issue (though then we ended up with two PRIMARY nodes).

Let me know if there is anything else that you regarding this (i.e. log files, etc, etc) and I'll see what I can do.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 15/Dec/13 ]

Another cause of this is running out of file descriptors.
Can you also check that possibility?

Comment by Mathias Söderberg [ 15/Dec/13 ]

Digging deeper, we seem to have found a possible explanation:

http://stackoverflow.com/questions/125466/using-glibc-why-does-my-gethostbyname-fail-after-i-dhcp-has-changed-the-dns-ser

We set up a test where we migrated from one DNS server to another while mongo was running, and were able to reproduce the exact same behaviour.

As the above link indicates, it turns out glibc caches the content of /etc/resolv.conf, which as far as I can tell means mongod has to be restarted (on Linux) in order to pick up changes to any DNS server configuration.

For a long-running server process this seems like a unnecessary limitation, and for glibc the problem can apparently be avoided by explicitly calling res_init (possibly after repeated name resolution failures).

Generated at Thu Feb 08 03:27:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.