[SERVER-4567] mongos process doesn't update on DNS change Created: 28/Dec/11  Updated: 30/Mar/12  Resolved: 29/Dec/11

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.8.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Site Operations Assignee: Richard Kreuter (Inactive)
Resolution: Done Votes: 0
Labels: dns, mongos, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

After changing the DNS entry and IP for a shard (by removing it from /etc/hosts), the mongos process continued to look for the old IP. DNS should be re-resolved at "trying reconnect" step.

Partial logs attached, restarting the mongos process fixed the problem.

Tue Dec 27 19:35:54 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:35:54 [Balancer] reconnect mongoshard05:27017 failed couldn't connect to server mongoshard05:27017
Tue Dec 27 19:35:54 [Balancer] ReplicaSetMonitor::_checkConnection: caught exception mongoshard05:27017 socket exception
Tue Dec 27 19:35:54 [Balancer] trying reconnect to mongoshard06:27017
Tue Dec 27 19:35:57 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:35:59 [Balancer] reconnect mongoshard06:27017 failed couldn't connect to server mongoshard06:27017
Tue Dec 27 19:35:59 [Balancer] ReplicaSetMonitor::_checkConnection: caught exception mongoshard06:27017 socket exception
Tue Dec 27 19:35:59 [ReplicaSetMonitorWatcher] trying reconnect to mongoshard04:27017
Tue Dec 27 19:36:00 [Balancer] ~ScopedDbConnection: _conn != null
Tue Dec 27 19:36:00 [Balancer] caught exception while doing balance: socket exception
Tue Dec 27 19:36:04 [ReplicaSetMonitorWatcher] reconnect mongoshard04:27017 failed couldn't connect to server mongoshard04:27017
Tue Dec 27 19:36:04 [ReplicaSetMonitorWatcher] trying reconnect to mongoshard05:27017
Tue Dec 27 19:36:06 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:09 [ReplicaSetMonitorWatcher] reconnect mongoshard05:27017 failed couldn't connect to server mongoshard05:27017
Tue Dec 27 19:36:09 [ReplicaSetMonitorWatcher] trying reconnect to mongoshard06:27017
Tue Dec 27 19:36:11 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:12 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:14 [ReplicaSetMonitorWatcher] reconnect mongoshard06:27017 failed couldn't connect to server mongoshard06:27017
Tue Dec 27 19:36:15 [ReplicaSetMonitorWatcher] trying reconnect to mongoshard04:27017
Tue Dec 27 19:36:15 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:16 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:16 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:19 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:20 [ReplicaSetMonitorWatcher] reconnect mongoshard04:27017 failed couldn't connect to server mongoshard04:27017
Tue Dec 27 19:36:20 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: caught exception mongoshard04:27017 socket exception
Tue Dec 27 19:36:20 [ReplicaSetMonitorWatcher] trying reconnect to mongoshard05:27017
Tue Dec 27 19:36:22 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:23 [WriteBackListener] WriteBackListener exception : socket exception
Tue Dec 27 19:36:25 [ReplicaSetMonitorWatcher] reconnect mongoshard05:27017 failed couldn't connect to server mongoshard05:27017
Tue Dec 27 19:36:25 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: caught exception mongoshard05:27017 socket exception
Tue Dec 27 19:36:25 [ReplicaSetMonitorWatcher] trying reconnect to mongoshard06:27017



 Comments   
Comment by Eliot Horowitz (Inactive) [ 29/Dec/11 ]

ping does something different for resolution.
Don't know off the top of my head - but I know it always does different things when I use it.

Comment by Site Operations [ 29/Dec/11 ]

Not running pdnsd, but one correction: the system that was fubar'd WAS using mdns, it was set to resolve as "files mdns dns". The majority of our systems are set up without mdns (including all of the shards and configs), only the app servers where the mongos was running are using it.

I assume that means this is not a bug? I'm still curious why ping worked fine but Mongo didn't...

Comment by Scott Hernandez (Inactive) [ 29/Dec/11 ]

Are you running pdnsd?

Comment by Site Operations [ 28/Dec/11 ]

Not using MDNS / Avahi, have a primary and secondary DNS set up with static IPs for all servers.

/etc/nsswitch.conf shows

hosts:          files dns

/etc/resolv.conf looks similar to this:

search example.com
nameserver 192.168.1.100
nameserver 192.168.1.101

Comment by Eliot Horowitz (Inactive) [ 28/Dec/11 ]

Are you using mdns?
Is mdns the first resolution step?

Comment by Site Operations [ 28/Dec/11 ]

It has to be "cached" somehow, since with no other changes restarting the mongos fixed the problem. No reboot was required. It also shouldn't be a TTL problem, since the entry was originally in the /etc/hosts, which carries no explicit TTL, and because this problem was occurring for several hours before it was fixed.

I was able to ping the config servers and the shard servers from cmd line while the errors were still coming through, which is why this was quite difficult to diagnose and resolve.

From the code, your assumption is valid. It looks like the only place it may be caching is for the actual connections in the WriteBackListener, when it does init() against each host. Is there anything obvious in the section with "WriteBackListener exception" that might fail and cause a reconnect with the same connection, but without doing a resolve?

Comment by Eliot Horowitz (Inactive) [ 28/Dec/11 ]

can you check resolution order and mdns?

mongo doesn't cache any dns entries, it just uses the system for all resolution.

Generated at Thu Feb 08 03:06:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.