[SERVER-12477] IP address resolution for the replica set node doesn't work properly. Created: 25/Jan/14  Updated: 18/Feb/14  Resolved: 18/Feb/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Kubenko Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Windows
Steps To Reproduce:

Turn off one replica set member and return it with different IP.

Participants:

 Description   

When a member of a replica set goes down and returns back with another IP address it can't find that node. I did nslookup for a name and it found it after that. Down below you can find the log
According to that log I did
nslookup MongoNode03
approximately at Fri Jan 24 23:36:46.995.

A bit more about how this happens: When in Azure datacenter I resize VM or VM is moved it get's another internal IP address. After that it appears with the new one and it can not be resolved.

Fri Jan 24 23:36:42.497 [rsHealthPoll] couldn't connect to MongoNode03:10001: couldn't connect to server MongoNode03:10001
Fri Jan 24 23:36:44.527 [rsHealthPoll] getaddrinfo("MongoNode03") failed: 000000014109CBD0
Fri Jan 24 23:36:44.527 [rsHealthPoll] couldn't connect to MongoNode03:10001: couldn't connect to server MongoNode03:10001
Fri Jan 24 23:36:44.871 [rsHealthPoll] getaddrinfo("MongoNode03") failed: 000000014109CBD0
Fri Jan 24 23:36:44.871 [rsHealthPoll] couldn't connect to MongoNode03:10001: couldn't connect to server MongoNode03:10001
Fri Jan 24 23:36:44.902 [rsHealthPoll] getaddrinfo("MongoNode03") failed: 000000014109CBD0
Fri Jan 24 23:36:44.902 [rsHealthPoll] replset info MongoNode03:10001 heartbeat failed, retrying
Fri Jan 24 23:36:44.933 [rsHealthPoll] getaddrinfo("MongoNode03") failed: 000000014109CBD0
Fri Jan 24 23:36:44.933 [rsHealthPoll] couldn't connect to MongoNode03:10001: couldn't connect to server MongoNode03:10001
Fri Jan 24 23:36:44.965 [rsHealthPoll] getaddrinfo("MongoNode03") failed: 000000014109CBD0
Fri Jan 24 23:36:44.965 [rsHealthPoll] couldn't connect to MongoNode03:10001: couldn't connect to server MongoNode03:10001
Fri Jan 24 23:36:44.996 [rsHealthPoll] getaddrinfo("MongoNode03") failed: 000000014109CBD0
Fri Jan 24 23:36:44.996 [rsHealthPoll] couldn't connect to MongoNode03:10001: couldn't connect to server MongoNode03:10001
Fri Jan 24 23:36:46.995 [rsHealthPoll] replSet member MongoNode03:10001 is up
Fri Jan 24 23:36:50.040 [initandlisten] connection accepted from 10.175.112.48:49158 #12558 (6 connections now open)
Fri Jan 24 23:36:50.040 [conn12558] authenticate db: local

{ authenticate: 1, nonce: "81d4c87bead86dce", user: "__system", key: "e0dc2d38615e9e8684f80a9a1262e82a" }

Fri Jan 24 23:36:50.040 [conn12558] end connection 10.175.112.48:49158 (5 connections now open)
Fri Jan 24 23:36:50.040 [initandlisten] connection accepted from 10.175.112.48:49159 #12559 (6 connections now open)
Fri Jan 24 23:36:50.040 [conn12559] authenticate db: local

{ authenticate: 1, nonce: "899da5d4bb84424b", user: "__system", key: "051477327e1d09958081c17367cf82ee" }

Fri Jan 24 23:36:50.992 [rsHealthPoll] replset info MongoNode03:10001 thinks that we are down
Fri Jan 24 23:36:50.992 [rsHealthPoll] replSet member MongoNode03:10001 is now in state STARTUP2
Fri Jan 24 23:37:06.265 [conn12559] end connection 10.175.112.48:49159 (5 connections now open)

Here's the details of the configuration where the problem appears:
Fri Jan 24 15:48:06.397 [initandlisten] git version: a350fc38922fbda2cec8d5dd842237b904eafc14
Fri Jan 24 15:48:06.397 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49



 Comments   
Comment by Kubenko [ 17/Feb/14 ]

I think we should close this as this issue has nothing to do with the mongo. It is configuration for the OS.
I don't have many transaction. It seems to me that it's not a big drama 15 minutes delay.

Comment by Kubenko [ 17/Feb/14 ]

Hi Mark,

Sorry for the delay. I don't have any of those values mentioned in the article in the VM registry.
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters]
"NegativeCacheTime"=dword:00000000
"NetFailureCacheTime"=dword:00000000
"NegativeSOACacheTime"=dword:00000000

Today I did a test. I provisioned VM. It started at 23.08 with the new ip.
Here're the details
PRJNAMEReplSet0:PRIMARY> rs.status()
{
"set" : "PRJNAMEReplSet0",
"date" : ISODate("2014-02-17T23:18:25Z"),
"myState" : 1,
"members" : [

{ "_id" : 0, "name" : "PRJNAME02:10001", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 421566, "optime" : Timestamp(1392641223, 1), "optimeDate" : ISODate("2014-02-17T12:47:03Z"), "self" : true }

,

{ "_id" : 1, "name" : "PRJNAME03:10001", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1392641223, 1), "optimeDate" : ISODate("2014-02-17T12:47:03Z"), "lastHeartbeat" : ISODate("2014-02-17T23:18:25Z"), "lastHeartbeatRecv" : ISODate("2014-02-17T23:05:41Z"), "pingMs" : 0, "syncingTo" : "PRJNAME02:10001" }

,

{ "_id" : 2, "name" : "PRJNAME04hidden:10001", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 6598, "optime" : Timestamp(1392641223, 1), "optimeDate" : ISODate("2014-02-17T12:47:03Z"), "lastHeartbeat" : ISODate("2014-02-17T23:18:23Z"), "lastHeartbeatRecv" : ISODate("2014-02-17T23:18:23Z"), "pingMs" : 1, "syncingTo" : "PRJNAME02:10001" }

],
"ok" : 1
}
PRJNAMEReplSet0:PRIMARY> rs.status()
{
"set" : "PRJNAMEReplSet0",
"date" : ISODate("2014-02-17T23:22:20Z"),
"myState" : 1,
"members" : [

{ "_id" : 0, "name" : "PRJNAME02:10001", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 421801, "optime" : Timestamp(1392641223, 1), "optimeDate" : ISODate("2014-02-17T12:47:03Z"), "self" : true }

,

{ "_id" : 1, "name" : "PRJNAME03:10001", "health" : 1, "state" : 5, "stateStr" : "STARTUP2", "uptime" : 87, "optime" : Timestamp(0, 0), "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2014-02-17T23:22:19Z"), "lastHeartbeatRecv" : ISODate("2014-02-17T23:22:18Z"), "pingMs" : 0, "lastHeartbeatMessage" : "initial sync cloning db: pdb", "syncingTo" : "PRJNAME02:10001" }

,

{ "_id" : 2, "name" : "PRJNAME04hidden:10001", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 6833, "optime" : Timestamp(1392641223, 1), "optimeDate" : ISODate("2014-02-17T12:47:03Z"), "lastHeartbeat" : ISODate("2014-02-17T23:22:19Z"), "lastHeartbeatRecv" : ISODate("2014-02-17T23:22:19Z"), "pingMs" : 1, "syncingTo" : "PRJNAME02:10001" }

],
"ok" : 1
}
PRJNAMEReplSet0:PRIMARY>

at 2014-02-17T23:18:25Z it was not available but at 2014-02-17T23:22:19Z it is starting looks like default delay is 10 or 15 minutes.
I'm thinking may be it is ok to leave it as is.

Comment by Mark Benvenuto [ 28/Jan/14 ]

Do you have negative DNS lookup caching enabled on your Windows VMs? Unfortunately, due to a bug in 2.4.8 the error message is unclear as to the exact cause : "000000014109CBD0". I expect that when you changed the configuration of your Azure tenant, the host lookup failed on the DNS server, was cached by the local machine, and was not fixed until you did a nslookup.

See http://adminfoo.net/2004/10/dont-cache-negative-dns-lookups-on.html

Generated at Thu Feb 08 03:28:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.