[SERVER-38496] A secondary should be able to determine if it's healthy (to avoid unwanted election calls) Created: 10/Dec/18 Updated: 17/May/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Dmitry Ryabtsev | Assignee: | Backlog - Replication Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Replication
|
||||
| Sprint: | Repl 2019-01-28 | ||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
A pv1 replica set member can call for an election due to its own performance degradation. If there happens to be a certain lag on other members, the election can be won by the affected member, which would not be a desirable outcome. It would be great if a secondary would be able to determine if that node itself is not healthy and therefore a call for election is not justified. |
| Comments |
| Comment by Judah Schvimer [ 10/Dec/18 ] |
|
If node A runs for election, while node B thinks node A is unhealthy, dmitry.ryabtsev and I discussed that it would be problematic for node B to vote 'no' to node A's election. To my understanding, this is specifically about node A choosing not to run for election if it knows itself to be unhealthy. This could lead to no node running for election in certain cases where currently an unhealthy node would get elected, or slower election times where an unhealthy node could get elected faster than waiting for a healthy one. Slower election times to a healthy node is probably better than faster elections to a worse primary. No primary at all is definitely a problem though. If the "unhealthiness" is due to networking problems, this could be bad for w:1 availability in certain cases, but doesn't seem to negatively impact w:majority availability since a primary that can't do networking is no better than no primary at all for w:majority writes. The w:1 availability problem could be addressed by a "networking unhealthy" node increasing its election timeout rather than not running for election at all. If the "unhealthiness" is due to storage or CPU problems, then w:1 availability is likely going to be a problem if the node becomes primary as well, so no primary is maybe not worse than a primary that can't write to disk. |