[SERVER-20188] heartbeat attempts spin when a node is down Created: 28/Aug/15  Updated: 15/Nov/21  Resolved: 22/Sep/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Start a 3-node replica set and then terminate one or more nodes.

Sprint: RPL A (10/09/15)
Participants:

 Description   

e.g.

2015-08-28T15:10:34.304-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27017; HostUnreachable couldn't connect to server lazarus:27017, connection attempt failed
2015-08-28T15:10:34.304-0400 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-08-28T15:10:34.304-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27019; HostUnreachable couldn't connect to server lazarus:27019, connection attempt failed
2015-08-28T15:10:34.314-0400 W NETWORK  [ReplExecNetThread-9] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2015-08-28T15:10:34.314-0400 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-08-28T15:10:34.314-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27017; HostUnreachable couldn't connect to server lazarus:27017, connection attempt failed
2015-08-28T15:10:34.315-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27019; HostUnreachable couldn't connect to server lazarus:27019, connection attempt failed
2015-08-28T15:10:34.315-0400 W NETWORK  [ReplExecNetThread-9] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2015-08-28T15:10:34.315-0400 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-08-28T15:10:34.315-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27017; HostUnreachable couldn't connect to server lazarus:27017, connection attempt failed
2015-08-28T15:10:34.315-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27019; HostUnreachable couldn't connect to server lazarus:27019, connection attempt failed
2015-08-28T15:10:34.315-0400 W NETWORK  [ReplExecNetThread-9] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2015-08-28T15:10:34.315-0400 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-08-28T15:10:34.316-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27017; HostUnreachable couldn't connect to server lazarus:27017, connection attempt failed
2015-08-28T15:10:34.316-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27019; HostUnreachable couldn't connect to server lazarus:27019, connection attempt failed
2015-08-28T15:10:34.326-0400 W NETWORK  [ReplExecNetThread-9] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2015-08-28T15:10:34.326-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27017; HostUnreachable couldn't connect to server lazarus:27017, connection attempt failed
2015-08-28T15:10:34.326-0400 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-08-28T15:10:34.327-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27019; HostUnreachable couldn't connect to server lazarus:27019, connection attempt failed
2015-08-28T15:10:34.327-0400 W NETWORK  [ReplExecNetThread-9] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2015-08-28T15:10:34.327-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27017; HostUnreachable couldn't connect to server lazarus:27017, connection attempt failed
2015-08-28T15:10:34.327-0400 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused
2015-08-28T15:10:34.327-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to lazarus:27019; HostUnreachable couldn't connect to server lazarus:27019, connection attempt failed
2015-08-28T15:10:34.328-0400 W NETWORK  [ReplExecNetThread-9] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2015-08-28T15:10:34.328-0400 W NETWORK  [ReplExecNetThread-7] Failed to connect to 127.0.0.1:27019, reason: errno:111 Connection refused



 Comments   
Comment by Victor Stewart [ 15/Jun/16 ]

I'm having a similar issue right now. I've trying everything from deleting all the local databases on each of the 3, to uninstalling and reinstalling mongo.. nothing seems to work. Every time I boot them back up, they fall down this hole of one of the Hosts being unreachable.. and it's the same server every time. Can't seem to figure it out... its okay after the first 2 servers are added into the replica set, then the 3rd adds fine.. then everything descends into chaos:

[ReplicationExecutor] Error in heartbeat request to request:27017; HostUnreachable: No route to host

Comment by Matt Dannenberg [ 22/Sep/15 ]

That is true only when a node lacks a sync source.

Comment by Eric Milkie [ 22/Sep/15 ]

If you set the electionTimeout to something like .5 seconds, the log volume can be overwhelming.

Comment by Matt Dannenberg [ 22/Sep/15 ]

Currently the nodes use the heartbeat interval which defaults to 2 seconds. Log evidence:

 
2015-09-22T05:39:30.285-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:30.286-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:30.286-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:32.286-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:32.287-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:32.288-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:34.287-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:34.288-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:34.288-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:36.288-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:36.289-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:36.290-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:38.291-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:38.292-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:38.292-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:40.292-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:40.293-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:40.293-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused

After reconfigging to have a heartbeatIntervalMillis of 60000:

2015-09-22T05:39:40.864-0400 I REPL     [ReplicationExecutor] This node is dannenstation.local:27018 in the config
2015-09-22T05:39:40.864-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:40.865-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:39:40.865-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:40:40.865-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:40:40.866-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:40:40.867-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:41:40.868-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:41:40.869-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused
2015-09-22T05:41:40.869-0400 I REPL     [ReplicationExecutor] Error in heartbeat request to dannenstation.local:27019; HostUnreachable Connection refused

Comment by Eric Milkie [ 21/Sep/15 ]

I set up a three node set and then shut one of the nodes down.

Comment by Eric Milkie [ 21/Sep/15 ]

pv 1

Comment by Matt Dannenberg [ 21/Sep/15 ]

Do you remember which protocolVersion exhibited this behavior?

Comment by Eric Milkie [ 21/Sep/15 ]

Need to investigate to see if this is still an issue.

Generated at Thu Feb 08 03:53:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.