Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14214

Leader election protocol is not resilient to message omissions

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Replication
    • ALL

      This results in liveness property of leader election being lost, i.e. a new master is never elected.

      Relatively easy way to trigger:
      1) Spawn a replicaset locally, replicate.py in https://github.com/dcci/mongo-replication-perf can be used for this.

      2) Once a primary is elected, drop all incoming local connections directed to it. Assuming the primary listening on 30001 this should be enough (on Linux, or whatever flavour of *NIX that supports iptables).

      # iptables -A INPUT -j DROP -p tcp -i lo --destination-port 30001
      # iptables -A INPUT -j DROP -p tcp --destination-port 30001
      

      Secondaries still receive heartbeats from primary so they don't change, as the log says.

      2014-06-09T11:45:47.792-0700 [rsHealthPoll] warning: Failed to connect to 127.0.0.1:30001 after 5000 milliseconds, giving up.
      2014-06-09T11:45:47.792-0700 [rsHealthPoll] replset info localhost:30001 heartbeat failed, retrying
      2014-06-09T11:45:50.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 5 more seconds
      2014-06-09T11:45:50.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 5 more seconds
      2014-06-09T11:45:52.793-0700 [rsHealthPoll] warning: Failed to connect to 127.0.0.1:30001, reason: errno:115 Operation now in progress
      2014-06-09T11:45:52.793-0700 [rsHealthPoll] replset info localhost:30001 just heartbeated us, but our heartbeat failed: , not changing state
      2014-06-09T11:45:55.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 0 more seconds
      2014-06-09T11:45:55.698-0700 [rsBackgroundSync] replSet not trying to sync from localhost:30001, it is vetoed for 0 more seconds
      2014-06-09T11:45:59.069-0700 [conn46] end connection 127.0.0.1:52528 (1 connection now open)
      2014-06-09T11:45:59.069-0700 [initandlisten] connection accepted from 127.0.0.1:52545 #48 (2 connections now open)
      2014-06-09T11:45:59.835-0700 [rsHealthPoll] warning: Failed to connect to 127.0.0.1:30001 after 5000 milliseconds, giving up.
      

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            davide.italiano Davide Italiano
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: