[SERVER-36387] Allow heartbeat responses to wake ready waiters even when they do not advance optimes Created: 01/Aug/18  Updated: 25/Nov/18  Resolved: 02/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Vesselina Ratcheva (Inactive) Assignee: Vesselina Ratcheva (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
Backport Requested:
v4.0, v3.6
Participants:

 Description   

Heartbeats and replSetUpdatePosition can only wake up replication waiters if they represent optime changes. If no further progress can be made (e.g. when the node in question fully catches up), those waiters will not be signaled unless new writes come in. This is not necessarily an issue in functionality like awaitReplication, but it can be a problem with stepdown. For example, during a stepdown attempt, it is possible to have secondaries catch up while they are frozen, then lift the freeze but have no way to signal the waiters (since everyone is already up to date), leading to the attempt timing out. This can be fixed by allowing heartbeat responses that do not advance optimes to still wake up replication waiters (by doing the minimal amount of work required for that).

This bug was introduced by the changes in SERVER-35058 (specifically here).



 Comments   
Comment by Vesselina Ratcheva (Inactive) [ 02/Aug/18 ]

A little bit of both, actually. Before SERVER-35058, the stepdown waiters were separate, so it was relatively cheap to wake them up, which the old behavior did. Now with those being unified, signaling would be more expensive, and we finally decided this was not worth the performance impact, since in PV1 it can only happen with deliberate manual intervention (in the form of freezing).

What I described in the ticket would be a big problem in PV0, since it has "VotedTooRecently" as a reason to not be electable. With this, you could easily run into the situation where the vote lease expires after everyone has caught up, leaving you with nothing to signal the waiters. We decided the solution is to simply skip SERVER-35058 when backporting election handoff to v3.6, since it is not strictly required for it to work.

Closing this as "Won't Fix".

Comment by Judah Schvimer [ 01/Aug/18 ]

Will this introduce any performance regressions around waking up waiters unnecessarily? Or is this returning to an old behavior?

Generated at Thu Feb 08 04:42:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.