[SERVER-32794] Make timeouts unrelated to elections not depend on election timeout Created: 19/Jan/18 Updated: 30/Oct/23 Resolved: 22/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.3, 3.7.2 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Judah Schvimer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Backport Requested: |
v3.6, v3.4
|
||||||||||||||||||||
| Sprint: | Repl 2018-01-29 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
For testing it can be helpful to increase the election timeout to infinity. We have multiple timeouts that are calculated based on the election timeout which prevent this. We should add a maximum to these timeouts, potentially based on the heartbeat interval like in the TopologyCoordinator below: |
| Comments |
| Comment by Githook User [ 26/Jan/18 ] |
|
Author: {'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah@mongodb.com'}Message: (cherry picked from commit f3b504948c0cef40deffb4786ebdda6797625142) |
| Comment by Githook User [ 22/Jan/18 ] |
|
Author: {'name': 'Judah Schvimer', 'email': 'judah@mongodb.com', 'username': 'judahschvimer'}Message: |
| Comment by Spencer Brody (Inactive) [ 22/Jan/18 ] |
|
schwerin, for #21, it probably isn't strictly necessary to put a cap on the upper bound of this value, but then again that's probably true for all of these. My thought was just that since this is the primary channel for conveying liveness information through the set it makes sense to keep that channel somewhat active. The current plan is to put the upper bound at 1 minute, which should already be far higher than anyone is likely to use in practice. |
| Comment by Judah Schvimer [ 22/Jan/18 ] |
|
siyuan.zhou, #21 is sync source feedback, not sync source resolver. Sync source feedback, i.e. replSetUpdatePosition, seems very related to primaries stepping down. |
| Comment by Siyuan Zhou [ 19/Jan/18 ] |
|
Agree with Spencer on "heartbeatTimeoutPeriod". I think topology detection should be separated from the decision whether the primary should step down. For #21, should sync source resolver rely on heartbeat's parameters instead of election timeout? It seems a separated issue from consensus. |
| Comment by Andy Schwerin [ 19/Jan/18 ] |
|
For #21, why do we care about liveness in this scenario? |
| Comment by Judah Schvimer [ 19/Jan/18 ] |
|
Oh, I think I was looking at PV0 code. |
| Comment by Spencer Brody (Inactive) [ 19/Jan/18 ] |
|
Hmm... I think it's fine to mark the node as down if you haven't heard from it in the heartbeat timeout. The problem is automatically stepping down when all nodes are down. We should only step down when a majority of nodes have been down for the election timeout. |
| Comment by Judah Schvimer [ 19/Jan/18 ] |
|
Per conversation, we'll make the maximum 30 seconds. Another problem will be heartbeat step downs. If we don't see heartbeats in the "heartbeatTimeoutPeriod" (not the heartbeat interval), then we'll set a node as down. If we receive a heartbeat and a majority of nodes are down, then we'll step down. This timeout is also 10 seconds by default. I think the only problem is here: I think this should maybe be the election timeout instead of the heartbeat timeout? spencer siyuan.zhou |
| Comment by Spencer Brody (Inactive) [ 19/Jan/18 ] |
|
I think we probably also want to update #21, since it's used for liveness. I think both #21 and #1 could be set to the min of half the election timeout, or twice the heartbeat timeout. I think we probably also want to put a cap on #26, maybe a minute. |
| Comment by Judah Schvimer [ 19/Jan/18 ] |
|
There are 26 occurrences of "getElectionTimeoutPeriod()" by grep. Looking through all occurrences of "electiontimeout" case insensitive, I don't see any others that should be a problem: |