[SERVER-30227] NetworkInterfaceASIO-Replication-0 Tries to connect to non existent host. Created: 19/Jul/17  Updated: 06/Oct/21  Resolved: 07/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Konstantinos Bessas Assignee: Mark Agarunov
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongodb.node1.log     Text File mongodb.node2.log     Text File mongodb.node3.log    
Operating System: ALL
Participants:

 Description   

I am running a MongoDB replica set with 3 nodes.

The cluster has authentication enabled and has (self signed SSL certificates). SSL mode is set to 'preferred'.

The version of MongoDB is 3.2.14, and it is the same for all the nodes.

The `rs.status()` shows that the cluster is working properly. There is one Primary node and 2 Secondary nodes. Secondary nodes are syncing from existing nodes.

mongoreplicaset:SECONDARY> rs.status()
{
	"set" : "mongoreplicaset",
	"date" : ISODate("2017-07-19T15:02:42.402Z"),
	"myState" : 2,
	"term" : NumberLong(32),
	"syncingTo" : "10.0.16.5:27017",
	"heartbeatIntervalMillis" : NumberLong(2000),
	"members" : [
		{
			"_id" : 5,
			"name" : "10.0.16.5:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 606374,
			"optime" : {
				"ts" : Timestamp(1500476413, 2),
				"t" : NumberLong(32)
			},
			"optimeDate" : ISODate("2017-07-19T15:00:13Z"),
			"lastHeartbeat" : ISODate("2017-07-19T15:02:41.235Z"),
			"lastHeartbeatRecv" : ISODate("2017-07-19T15:02:40.460Z"),
			"pingMs" : NumberLong(0),
			"electionTime" : Timestamp(1499870258, 1),
			"electionDate" : ISODate("2017-07-12T14:37:38Z"),
			"configVersion" : 2932125
		},
		{
			"_id" : 101,
			"name" : "10.0.16.101:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 606595,
			"optime" : {
				"ts" : Timestamp(1500476413, 2),
				"t" : NumberLong(32)
			},
			"optimeDate" : ISODate("2017-07-19T15:00:13Z"),
			"syncingTo" : "10.0.16.5:27017",
			"configVersion" : 2932125,
			"self" : true
		},
		{
			"_id" : 144,
			"name" : "10.0.16.144:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 606233,
			"optime" : {
				"ts" : Timestamp(1500476413, 2),
				"t" : NumberLong(32)
			},
			"optimeDate" : ISODate("2017-07-19T15:00:13Z"),
			"lastHeartbeat" : ISODate("2017-07-19T15:02:41.335Z"),
			"lastHeartbeatRecv" : ISODate("2017-07-19T15:02:40.693Z"),
			"pingMs" : NumberLong(0),
			"syncingTo" : "10.0.16.101:27017",
			"configVersion" : 2932125
		}
	],
	"ok" : 1
}

Yet, in the logs, we get every 30 seconds the following error:

2017-07-19T17:01:33.459+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.0.16.155:27017 - ExceededTimeLimit: Operation timed out
2017-07-19T17:01:33.459+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.0.16.155:27017

This address is not defined anywhere.

We face the same issue with every MongoDB replica set. Each time the address they try to connect is different, but the error the same.



 Comments   
Comment by Donald Duck [ 06/Oct/21 ]

We have the same problem, I guess.

 

2021-10-06T11:43:31.002+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.31.6.61:27017 - NetworkInterfaceExceededTimeLimit: Operation timed out
2021-10-06T11:43:31.002+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.31.6.61:27017
2021-10-06T11:43:51.002+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.31.6.61:27017 - NetworkInterfaceExceededTimeLimit: Operation timed out
2021-10-06T11:43:51.002+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.31.6.61:27017
2021-10-06T11:44:11.003+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.31.6.61:27017 - NetworkInterfaceExceededTimeLimit: Operation timed out
2021-10-06T11:44:11.003+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.31.6.61:27017
2021-10-06T11:44:31.003+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.31.6.61:27017 - NetworkInterfaceExceededTimeLimit: Operation timed out
2021-10-06T11:44:31.003+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.31.6.61:27017
2021-10-06T11:44:51.003+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.31.6.61:27017 - NetworkInterfaceExceededTimeLimit: Operation timed out

 

This keeps occurring; different nodes in the cluster try to connect to different IPs, though.

Comment by Mark Agarunov [ 07/Sep/17 ]

Hello kbessas.sentia,

Thank you for the additional information. Taking a closer look at the logs, I'm seeing that there's an external client that is connecting and issuing replSetReconfig commands which end up configuring the replica set to these incorrect hosts. This does not appear to be coming from anything internal to mongodb, but an external connection. An example from the logs:

2017-08-02T17:28:27.894+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:51404 #118 (2 connections now open)
2017-08-02T17:28:27.894+0200 I NETWORK  [conn118] SSL mode is set to 'preferred' and connection 118 to 127.0.0.1:51404 is not using SSL.
2017-08-02T17:28:27.909+0200 I ACCESS   [conn118] Successfully authenticated as principal admin on admin
2017-08-02T17:28:27.912+0200 I REPL     [conn118] replSetReconfig admin command received from client
2017-08-02T17:28:27.913+0200 I REPL     [conn118] replSetReconfig config object with 1 members parses ok
2017-08-02T17:28:27.913+0200 I REPL     [replExecDBWorker-2] New replica set config in use: { _id: "replica0", version: 5525020, protocolVersion: 1, members: [ { _id: 0, host: "localhost:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('59819b21ed73b34506d23010') } }
2017-08-02T17:28:27.913+0200 I REPL     [replExecDBWorker-2] This node is localhost:27017 in the config
2017-08-02T17:28:27.913+0200 I REPL     [replExecDBWorker-2] Canceling priority takeover callback
2017-08-02T17:28:27.913+0200 I REPL     [replExecDBWorker-2] conducting a dry run election to see if we could be elected
2017-08-02T17:28:27.913+0200 I REPL     [replExecDBWorker-2] dry election run succeeded, running for election
2017-08-02T17:28:27.915+0200 I REPL     [conn118] replSetReconfig admin command received from client
2017-08-02T17:28:27.915+0200 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 80
2017-08-02T17:28:27.915+0200 I REPL     [ReplicationExecutor] transition to PRIMARY
2017-08-02T17:28:27.923+0200 W NETWORK  [conn118] The server certificate does not match the host name 10.1.16.72
2017-08-02T17:28:27.928+0200 W NETWORK  [conn118] couldn't check isSelf (10.1.16.176:27017) socket exception [CLOSED] for 10.1.16.176:27017
2017-08-02T17:28:27.928+0200 I REPL     [conn118] replSetReconfig config object with 3 members parses ok
2017-08-02T17:28:27.929+0200 I REPL     [replExecDBWorker-1] New replica set config in use: { _id: "replica0", version: 5600818, protocolVersion: 1, members: [ { _id: 48, host: "10.1.16.48:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 999.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 72, host: "10.1.16.72:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 498.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 176, host: "10.1.16.176:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 248.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('59819b21ed73b34506d23010') } }
2017-08-02T17:28:27.929+0200 I REPL     [replExecDBWorker-1] This node is 10.1.16.48:27017 in the config
2017-08-02T17:28:27.929+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.1.16.176:27017
2017-08-02T17:28:27.930+0200 I REPL     [ReplicationExecutor] Member 10.1.16.72:27017 is now in state PRIMARY
2017-08-02T17:28:27.930+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.1.16.176:27017 - HostUnreachable: short read

Thanks,
Mark

Comment by Konstantinos Bessas [ 01/Sep/17 ]

Hello Mark,

I can confirm that we are still facing the issue. We are facing this issue in all our replica sets (15+ clusters).

In order to assist you, I will give you some more information about our set up and how it works:

We have MongoDB clusters set up in AWS. We are using a base AMI where we have MongoDB installed as well as scripts to maintain the cluster information. The instances can be terminated but we have persistent volumes for data, journal and logs. On boot of an instance we mount the volumes, we start MongoDB and we make sure each node knows the location (ip addresses) of the other nodes by reconfiguring the replica set (this is happening automatically).

The problems start when we terminate an instance (or more instances) several times. After a few terminations these error messages appear in the logs although the cluster is correctly configured.

Comment by Mark Agarunov [ 30/Aug/17 ]

Hello kbessas.sentia,

Thank you for providing this information. Unfortunately I haven't been able to reproduce this behavior yet. Are you still experiencing this behavior? If so, does this behavior persist when the nodes are restarted?

Thanks,
Mark

Comment by Konstantinos Bessas [ 03/Aug/17 ]

Hello,

Please find attached the log files of the 3 nodes.

Here is a snippet of an error from the logs:

2017-08-03T12:38:11.260+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.1.16.176:27017 - ExceededTimeLimit: Operation timed out
2017-08-03T12:38:11.260+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.1.16.176:27017
2017-08-03T12:38:12.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.1.16.184:27017 - ExceededTimeLimit: Operation timed out
2017-08-03T12:38:12.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.1.16.184:27017
2017-08-03T12:38:31.260+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.1.16.176:27017 - ExceededTimeLimit: Operation timed out
2017-08-03T12:38:31.260+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.1.16.176:27017
2017-08-03T12:38:32.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.1.16.184:27017 - ExceededTimeLimit: Operation timed out
2017-08-03T12:38:32.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.1.16.184:27017

At the same time, here is the output of

rs.status()

replica0:PRIMARY> rs.status()
{
	"set" : "replica0",
	"date" : ISODate("2017-08-03T10:36:29.857Z"),
	"myState" : 1,
	"term" : NumberLong(116),
	"heartbeatIntervalMillis" : NumberLong(2000),
	"members" : [
		{
			"_id" : 51,
			"name" : "10.1.16.51:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 9610,
			"optime" : {
				"ts" : Timestamp(1501751308, 1),
				"t" : NumberLong(116)
			},
			"optimeDate" : ISODate("2017-08-03T09:08:28Z"),
			"electionTime" : Timestamp(1501751307, 1),
			"electionDate" : ISODate("2017-08-03T09:08:27Z"),
			"configVersion" : 8575183,
			"self" : true
		},
		{
			"_id" : 79,
			"name" : "10.1.16.79:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 5282,
			"optime" : {
				"ts" : Timestamp(1501751308, 1),
				"t" : NumberLong(116)
			},
			"optimeDate" : ISODate("2017-08-03T09:08:28Z"),
			"lastHeartbeat" : ISODate("2017-08-03T10:36:27.896Z"),
			"lastHeartbeatRecv" : ISODate("2017-08-03T10:36:25.284Z"),
			"pingMs" : NumberLong(0),
			"configVersion" : 8575183
		},
		{
			"_id" : 144,
			"name" : "10.1.16.144:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 5232,
			"optime" : {
				"ts" : Timestamp(1501751308, 1),
				"t" : NumberLong(116)
			},
			"optimeDate" : ISODate("2017-08-03T09:08:28Z"),
			"lastHeartbeat" : ISODate("2017-08-03T10:36:27.894Z"),
			"lastHeartbeatRecv" : ISODate("2017-08-03T10:36:28.623Z"),
			"pingMs" : NumberLong(0),
			"configVersion" : 8575183
		}
	],
	"ok" : 1
}

and rs.conf()

replica0:PRIMARY> rs.conf()
{
	"_id" : "replica0",
	"version" : 8575183,
	"protocolVersion" : NumberLong(1),
	"members" : [
		{
			"_id" : 51,
			"host" : "10.1.16.51:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 999,
			"tags" : {
 
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 79,
			"host" : "10.1.16.79:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 498,
			"tags" : {
 
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 144,
			"host" : "10.1.16.144:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 248,
			"tags" : {
 
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"getLastErrorModes" : {
 
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("59819b21ed73b34506d23010")
	}
}

Comment by Mark Agarunov [ 27/Jul/17 ]

Hello kbessas.sentia,

Thank you for reporting this behavior. To get a better idea of what might be causing this and why mongod is trying to sync from invalid nodes, please provide the complete log files from all mongod nodes affected.

Thanks,
Mark

Comment by Konstantinos Bessas [ 19/Jul/17 ]

Digging into the logs, we have found some more information.

Apparently the node with ip 10.0.16.155 was a valid node of this cluster (and in particular the Primary node), until an update triggered a replacement of all the nodes. Despite of the fact that the nodes got replaced, the Secondaries somehow still retain the information about this node (cached?).

We performed one extra test and "stepped down" the primary node. While the primary node was secondary, in the logs the same error appeared, but this time the node was trying to connect to a node with ip 10.0.16.105. This node was also part of the replica set as well long ago (but not anymore).

It seems that when nodes are replaced while the data, journal and log volumes are preserved, despite reconfiguring the replica set with the correct IPs, old node IPs remain cached somewhere within the nodes.

Generated at Thu Feb 08 04:23:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.