[SERVER-20188] heartbeat attempts spin when a node is down Created: 28/Aug/15 Updated: 15/Nov/21 Resolved: 22/Sep/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Eric Milkie | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Steps To Reproduce: | Start a 3-node replica set and then terminate one or more nodes. |
| Sprint: | RPL A (10/09/15) |
| Participants: |
| Description |
|
e.g.
|
| Comments |
| Comment by Victor Stewart [ 15/Jun/16 ] | |||||||||||||||||||||||||||||
|
I'm having a similar issue right now. I've trying everything from deleting all the local databases on each of the 3, to uninstalling and reinstalling mongo.. nothing seems to work. Every time I boot them back up, they fall down this hole of one of the Hosts being unreachable.. and it's the same server every time. Can't seem to figure it out... its okay after the first 2 servers are added into the replica set, then the 3rd adds fine.. then everything descends into chaos: [ReplicationExecutor] Error in heartbeat request to request:27017; HostUnreachable: No route to host | |||||||||||||||||||||||||||||
| Comment by Matt Dannenberg [ 22/Sep/15 ] | |||||||||||||||||||||||||||||
|
That is true only when a node lacks a sync source. | |||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 22/Sep/15 ] | |||||||||||||||||||||||||||||
|
If you set the electionTimeout to something like .5 seconds, the log volume can be overwhelming. | |||||||||||||||||||||||||||||
| Comment by Matt Dannenberg [ 22/Sep/15 ] | |||||||||||||||||||||||||||||
|
Currently the nodes use the heartbeat interval which defaults to 2 seconds. Log evidence:
After reconfigging to have a heartbeatIntervalMillis of 60000:
| |||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 21/Sep/15 ] | |||||||||||||||||||||||||||||
|
I set up a three node set and then shut one of the nodes down. | |||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 21/Sep/15 ] | |||||||||||||||||||||||||||||
|
pv 1 | |||||||||||||||||||||||||||||
| Comment by Matt Dannenberg [ 21/Sep/15 ] | |||||||||||||||||||||||||||||
|
Do you remember which protocolVersion exhibited this behavior? | |||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 21/Sep/15 ] | |||||||||||||||||||||||||||||
|
Need to investigate to see if this is still an issue. |