[SERVER-8084] Replica set may be unavailable for up to 30 seconds after killing primary Created: 04/Jan/13  Updated: 10/Dec/14  Resolved: 05/Mar/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chris Heald Assignee: Unassigned
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-10225 Replica set failover speed improvement Closed
Related
is related to RUBY-523 Runtime improvements for replica set ... Closed
Operating System: ALL
Steps To Reproduce:

You can observe this behavior by using the Ruby driver's replica set test suite. Invoke the test suite with `rake test:replica_set` and then `tail -f data//.log` once the replica set logfiles have been created. You will observe that the RS times out for ~30 seconds after the primary is killed immediately after a successful election due to the inability of any of the remaining nodes to cast a vote due to the held leases.

Participants:

 Description   

This was discovered in the course of testing the Ruby driver. Issue here: https://jira.mongodb.org/browse/RUBY-523

consensus.cpp has a hardcoded 30-second "lease" after a RS member casts a vote, which prevents it from casting another vote in that time period. This means that if you spin up a replica set, allow a primary to be elected, and then kill the primary, the entire replica set is unavailable for up to 30 seconds, as the nodes have to wait out the lease in order to be able to cast another vote to elect a new master to replace the killed master.

I think that once an election has succeeded, nodes should clear their lease timers, so that they are immediately available for another election. Additionally (or alternately), if a node holds a vote lease, it should check that the node that it voted for is still a part of the cluster before refusing to recast its vote. If the voted-for member has disappeared, then the node should cast a new vote.


Generated at Thu Feb 08 03:16:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.