[SERVER-14214] Leader election protocol is not resilient to message omissions Created: 09/Jun/14  Updated: 06/Dec/22  Resolved: 23/Nov/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Davide Italiano Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: 28qa, elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

This results in liveness property of leader election being lost, i.e. a new master is never elected.

Relatively easy way to trigger:
1) Spawn a replicaset locally, replicate.py in https://github.com/dcci/mongo-replication-perf can be used for this.

2) Once a primary is elected, drop all incoming local connections directed to it. Assuming the primary listening on 30001 this should be enough (on Linux, or whatever flavour of *NIX that supports iptables).

# iptables -A INPUT -j DROP -p tcp -i lo --destination-port 30001
# iptables -A INPUT -j DROP -p tcp --destination-port 30001

Secondaries still receive heartbeats from primary so they don't change, as the log says.

2014-06-09T11:45:47.792-0700 [rsHealthPoll] warning: Failed to connect to 127.0.0.1:30001 after 5000 milliseconds, giving up.
2014-06-09T11:45:47.792-0700 [rsHealthPoll] replset info localhost:30001 heartbeat failed, retrying
2014-06-09T11:45:50.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 5 more seconds
2014-06-09T11:45:50.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 5 more seconds
2014-06-09T11:45:52.793-0700 [rsHealthPoll] warning: Failed to connect to 127.0.0.1:30001, reason: errno:115 Operation now in progress
2014-06-09T11:45:52.793-0700 [rsHealthPoll] replset info localhost:30001 just heartbeated us, but our heartbeat failed: , not changing state
2014-06-09T11:45:55.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 0 more seconds
2014-06-09T11:45:55.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 0 more seconds
2014-06-09T11:45:59.069-0700 [conn46] end connection 127.0.0.1:52528 (1 connection now open)
2014-06-09T11:45:59.069-0700 [initandlisten] connection accepted from 127.0.0.1:52545 #48 (2 connections now open)
2014-06-09T11:45:59.835-0700 [rsHealthPoll] warning: Failed to connect to 127.0.0.1:30001 after 5000 milliseconds, giving up.



 Comments   
Comment by Spencer Brody (Inactive) [ 20/Dec/16 ]

According to milkie's comment on SERVER-12793, it looks like this behavior has been changed since at least v3.0

Comment by Ion Caliman [ 20/Dec/16 ]

No longer since when ? Which release has the fix ?

Comment by Spencer Brody (Inactive) [ 23/Nov/16 ]

We no longer consider a node up if we receive heartbeats from it but cannot heartbeat it ourselves.

Comment by Eric Milkie [ 16/Jul/14 ]

We'll re-examine this after the election enhancements are complete.

Comment by Eric Milkie [ 09/Jun/14 ]

"Operation still in progress" is an odd error to get. I think it means that the prior attempt to connect hasn't yet cleared out. This should get smoother once we have a better story for connecting using nonblocking sockets.

Generated at Thu Feb 08 03:34:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.