Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8084

Replica set may be unavailable for up to 30 seconds after killing primary

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.2
    • Component/s: Replication
    • None
    • ALL
    • Hide

      You can observe this behavior by using the Ruby driver's replica set test suite. Invoke the test suite with `rake test:replica_set` and then `tail -f data//.log` once the replica set logfiles have been created. You will observe that the RS times out for ~30 seconds after the primary is killed immediately after a successful election due to the inability of any of the remaining nodes to cast a vote due to the held leases.

      Show
      You can observe this behavior by using the Ruby driver's replica set test suite. Invoke the test suite with `rake test:replica_set` and then `tail -f data/ / .log` once the replica set logfiles have been created. You will observe that the RS times out for ~30 seconds after the primary is killed immediately after a successful election due to the inability of any of the remaining nodes to cast a vote due to the held leases.

      This was discovered in the course of testing the Ruby driver. Issue here: https://jira.mongodb.org/browse/RUBY-523

      consensus.cpp has a hardcoded 30-second "lease" after a RS member casts a vote, which prevents it from casting another vote in that time period. This means that if you spin up a replica set, allow a primary to be elected, and then kill the primary, the entire replica set is unavailable for up to 30 seconds, as the nodes have to wait out the lease in order to be able to cast another vote to elect a new master to replace the killed master.

      I think that once an election has succeeded, nodes should clear their lease timers, so that they are immediately available for another election. Additionally (or alternately), if a node holds a vote lease, it should check that the node that it voted for is still a part of the cluster before refusing to recast its vote. If the voted-for member has disappeared, then the node should cast a new vote.

            Assignee:
            Unassigned Unassigned
            Reporter:
            cheald Chris Heald
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: