[SERVER-4567] mongos process doesn't update on DNS change Created: 28/Dec/11 Updated: 30/Mar/12 Resolved: 29/Dec/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 1.8.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Site Operations | Assignee: | Richard Kreuter (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | dns, mongos, sharding | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
After changing the DNS entry and IP for a shard (by removing it from /etc/hosts), the mongos process continued to look for the old IP. DNS should be re-resolved at "trying reconnect" step. Partial logs attached, restarting the mongos process fixed the problem.
|
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 29/Dec/11 ] | ||||
|
ping does something different for resolution. | ||||
| Comment by Site Operations [ 29/Dec/11 ] | ||||
|
Not running pdnsd, but one correction: the system that was fubar'd WAS using mdns, it was set to resolve as "files mdns dns". The majority of our systems are set up without mdns (including all of the shards and configs), only the app servers where the mongos was running are using it. I assume that means this is not a bug? I'm still curious why ping worked fine but Mongo didn't... | ||||
| Comment by Scott Hernandez (Inactive) [ 29/Dec/11 ] | ||||
|
Are you running pdnsd? | ||||
| Comment by Site Operations [ 28/Dec/11 ] | ||||
|
Not using MDNS / Avahi, have a primary and secondary DNS set up with static IPs for all servers. /etc/nsswitch.conf shows
/etc/resolv.conf looks similar to this:
| ||||
| Comment by Eliot Horowitz (Inactive) [ 28/Dec/11 ] | ||||
|
Are you using mdns? | ||||
| Comment by Site Operations [ 28/Dec/11 ] | ||||
|
It has to be "cached" somehow, since with no other changes restarting the mongos fixed the problem. No reboot was required. It also shouldn't be a TTL problem, since the entry was originally in the /etc/hosts, which carries no explicit TTL, and because this problem was occurring for several hours before it was fixed. I was able to ping the config servers and the shard servers from cmd line while the errors were still coming through, which is why this was quite difficult to diagnose and resolve. From the code, your assumption is valid. It looks like the only place it may be caching is for the actual connections in the WriteBackListener, when it does init() against each host. Is there anything obvious in the section with "WriteBackListener exception" that might fail and cause a reconnect with the same connection, but without doing a resolve? | ||||
| Comment by Eliot Horowitz (Inactive) [ 28/Dec/11 ] | ||||
|
can you check resolution order and mdns? mongo doesn't cache any dns entries, it just uses the system for all resolution. |