[SERVER-12099] getaddrinfo(hostname:port) - Temporary failure in name resolution Created: 15/Dec/13 Updated: 10/Dec/14 Resolved: 05/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Stability |
| Affects Version/s: | 2.2.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mathias Söderberg | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | cluster, dns, replication, replset | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux 2.6.35.14-103.47.amzn1.x86_64, Amazon EC2 |
||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Participants: | |||||||||
| Description |
|
We're running a couple of MongoDB clusters in AWS EC2 and the other day one of our clusters started to misbehave. It turned out that the PRIMARY node was unable to contact the other nodes in the cluster. In it's logs we could see the following lines repeated (for brewity I've only included one host in the lines, but it showed the same for the third node as well): Sat Dec 14 01:55:54 [rsHealthPoll] getaddrinfo("host2") failed: Temporary failure in name resolution On the other nodes we observed the following in their logs: Sat Dec 14 01:56:24 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: host1:27017 The first thing we tested was to manually add the other nodes (host2 and host3) IP addresses to host1's /etc/hosts file, and within a minute the issue seemed resolved (i.e. it could talk to the other nodes again), but it was hardly a permanent solution. We continued by running rs.stepDown() through the mongo console, removed the manually added lines in /etc/hosts and then restarted mongod on host1. When it booted up it was once again able to connect to the other nodes. It should be noted that while host1 was unable to contact either host2 and host3, we were able to use dig to verify that our internal DNS was able to resolve the DNS name, and we could SSH from on host to the other. Our other services, that are using the same DNS server, showed no similar behaviour (i.e. problems with DNS resolving), and we also have an alarm check in place for DNS resolving, which didn't trigger. Another thing of interest is that host1 has been running mongod since 27th of November 2012 while the other hosts (host2 and host3) were recently restarted (23th of November 2013) when we had a similar issue (though then we ended up with two PRIMARY nodes). Let me know if there is anything else that you regarding this (i.e. log files, etc, etc) and I'll see what I can do. |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 15/Dec/13 ] |
|
Another cause of this is running out of file descriptors. |
| Comment by Mathias Söderberg [ 15/Dec/13 ] |
|
Digging deeper, we seem to have found a possible explanation: We set up a test where we migrated from one DNS server to another while mongo was running, and were able to reproduce the exact same behaviour. As the above link indicates, it turns out glibc caches the content of /etc/resolv.conf, which as far as I can tell means mongod has to be restarted (on Linux) in order to pick up changes to any DNS server configuration. For a long-running server process this seems like a unnecessary limitation, and for glibc the problem can apparently be avoided by explicitly calling res_init (possibly after repeated name resolution failures). |