[SERVER-11086] Election handoff to new primary, during stepdown Created: 08/Oct/13  Updated: 01/Nov/18  Resolved: 01/Nov/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.7, 4.0.2

Type: Improvement Priority: Major - P3
Reporter: Charlie Page Assignee: Alyson Cabral (Inactive)
Resolution: Done Votes: 20
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Documented
is documented by DOCS-12181 Docs for SERVER-11086: Election hando... Closed
Duplicate
is duplicated by SERVER-22050 Primary vote for stepdown is unreason... Closed
Related
related to SERVER-22502 Replication Protocol 1 rollbacks are ... Closed
related to SERVER-32906 Improve logging around elections Closed
is related to SERVER-10225 Replica set failover speed improvement Closed
is related to SERVER-18453 Avoiding Rollbacks in new Raft based ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 12 (04/01/16)
Participants:

 Description   

When a step-down command is issued a fast election mode could be implemented to allow seamless transition to a new primary without having to wait for the time to do the election. The goal is to reduce the time without a primary.



 Comments   
Comment by Alyson Cabral (Inactive) [ 01/Nov/18 ]

Often the longest part of an election is the time spent determining the primary is down. To avoid spurious elections we allow users to specify the time window to wait, confirming a node is down and not responsive, before calling an election. Otherwise, a simple network blip could trigger an election even when there is a perfectly healthy primary.

When explicitly calling stepdown on the primary, we have built in an optimization that bypasses the time waiting to confirm a node was down and immediately tells the appropriate secondary to call an election. This project, Election Handoff, will minimize write downtime when performing planned maintenance of a cluster. During scheduled maintenance windows, like upgrading a cluster or performing a rolling index build, it is often necessary to step down the primary. We've seen a 10x improvement to stepdown times using the default election configuration.

The Election Handoff behavior was added in 3.6.7, 4.0.2 and even further optimized in 3.6.9, 4.0.3.

Comment by Oleg Rekutin [ 30/Jan/18 ]

Dmitry, we ended up writing JavaScript code for "planned stepdowns," that runs in the Mongo Shell, which: 1) lowers the election timeout from 10 seconds to 2 seconds, 2) performs the step-down, 3) reaches to new primary and raises election timeout back to 10 seconds.

I still would love for Mongo to support a fast maintenance step-down that's in milliseconds and not seconds.

Comment by Dmitry Mikhaylov [ 08/Jan/18 ]

Seeing this issue stagnating for more than 4 years now makes me wonder - how many people using MongoDB for mission-critical purposes are actually doing maintenance? We do, and each maintenance becomes a real problem, as it means very noticeable disruption in service for our users. With default timeout of 10 seconds and our load, the real times without primary are ~13 seconds from application POV.

What makes this especially wounding is that, according to the logs, there are 10 seconds during which all members of replica set are in SECONDARY state and know all others are in SECONDARY state. In effect the whole replica set knows there will be no primary until elections and yet no-one dares to start the election until timeout.

I understand that having special re-election logic in step-down is tricky. But this issue can fixed with smaller (or none at all) re-election timeout for the situation when all instances are SECONDARY. Maybe this solution is easier and less risky?

Comment by Hoyt Ren [ 08/Jan/16 ]

Yes, when you do a maintenance, you will want to minimize the interruption, we calculate the duration in MS but not S, so this feature is needed.

Generated at Thu Feb 08 03:24:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.