[SERVER-14214] Leader election protocol is not resilient to message omissions Created: 09/Jun/14 Updated: 06/Dec/22 Resolved: 23/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Davide Italiano | Assignee: | Backlog - Replication Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | 28qa, elections | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Replication
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
This results in liveness property of leader election being lost, i.e. a new master is never elected. Relatively easy way to trigger: 2) Once a primary is elected, drop all incoming local connections directed to it. Assuming the primary listening on 30001 this should be enough (on Linux, or whatever flavour of *NIX that supports iptables).
Secondaries still receive heartbeats from primary so they don't change, as the log says.
|
| Comments |
| Comment by Spencer Brody (Inactive) [ 20/Dec/16 ] |
|
According to milkie's comment on |
| Comment by Ion Caliman [ 20/Dec/16 ] |
|
No longer since when ? Which release has the fix ? |
| Comment by Spencer Brody (Inactive) [ 23/Nov/16 ] |
|
We no longer consider a node up if we receive heartbeats from it but cannot heartbeat it ourselves. |
| Comment by Eric Milkie [ 16/Jul/14 ] |
|
We'll re-examine this after the election enhancements are complete. |
| Comment by Eric Milkie [ 09/Jun/14 ] |
|
"Operation still in progress" is an odd error to get. I think it means that the prior attempt to connect hasn't yet cleared out. This should get smoother once we have a better story for connecting using nonblocking sockets. |