[SERVER-28593] Separate notion of 'down' and 'unknown' in heartbeat liveness monitoring Created: 03/Apr/17 Updated: 06/Dec/22 Resolved: 04/Apr/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Backlog - Replication Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Replication
|
| Participants: |
| Description |
|
When a node first starts up, it has no idea the state of the other nodes in the replica set. Currently it defaults to assuming all nodes are 'down' until it gets a heartbeat proving otherwise. This can sometimes cause nodes to call for elections unnecessarily. Instead we should consider nodes in state 'unknown' until the heartbeat timeout passes without hearing from them, and not call for an election so long as any nodes are still in state 'unknown'. |
| Comments |
| Comment by Spencer Brody (Inactive) [ 04/Apr/17 ] |
|
milkie made a good point, the heartbeat timeout is generally less than or equal to the election timeout (currently they both default to 10 seconds) so there shouldn't really be a case where we call an election at startup while still waiting for a heartbeat to time out. So I feel like the cases we've seen in build failures we thought were caused by this may have something else going on. |
| Comment by Siyuan Zhou [ 04/Apr/17 ] |
|
The health field of MemberHeartbeatData is already an integer. -1 means unknown, 0 means down and 1 means up. So this ticket is more about not calling for an election as long as any nodes are still unknown. |