[SERVER-32237] Nodes that cannot become primary must neither update progress nor vote "aye" Created: 08/Dec/17 Updated: 06/Dec/22 Resolved: 05/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Backlog - Replication Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Consider a 3 node replica set with a primary, a secondary, and a voting-unelectable node (rollback, initial sync, or recovering). Consider the case where all nodes are replicating from the primary. The primary takes writes at times T1, T2, and T3 with w:majority. The secondary replicates the write at T1, and the voting-unelectable node replicates the writes at T1 and T2. The primary will see that T1 and T2 are both replicated to a majority and it will commit them and acknowledge them to the client. Now, if the primary crashes, consider what occurs. The secondary is behind the voting-unelectable node, so the voting-unelectable node won't vote for it (and can't because then we'd lose the majority-committed write), but the other node is unelectable. We will thus not be able to elect a primary. If the unelectable node is also inconsistent, this is even worse because there is no way to make it electable.Thus we should not update our progress if we're unelectable. The node should not vote "aye" either. While voting "aye" will not cause us to lose committed writes (assuming we do not update progress as above), it will cause the unelectable node to vote for nodes that cannot commit writes, since it cannot be part of a majority to help commit writes. |
| Comments |
| Comment by Judah Schvimer [ 05/Feb/20 ] |
|
This ticket and |
| Comment by Judah Schvimer [ 03/Jan/18 ] |
If you're okay with voting for a primary that cannot commit majority writes, then I think it is fine to keep voting. Users may find this behavior surprising and it could lead to longer rollbacks. It could also lead to a primary being elected that cannot commit majority writes even if another node exists that could immediately commit majority writes if it were elected. |
| Comment by Spencer Brody (Inactive) [ 02/Jan/18 ] |
|
judah.schvimer, thinking about this further, do we actually need to not vote "aye" or only to not report progress? If we stop reporting progress then we don't need to worry about incorrectly satisfying a w:majority write, but if we keep voting (initial sync could consider all other nodes ahead of us, rollback could vote with the last common point) then we don't risk reducing write availability unnecessarily. |
| Comment by Eric Milkie [ 09/Dec/17 ] |
|
Also, I’m not sure users will expect that their commit level may stop moving after setting maintenance mode, if we make it stop reporting position. |
| Comment by Eric Milkie [ 09/Dec/17 ] |
|
Arbiters are also “nodes that cannot become primary”. I don’t think you can prohibit them from voting “aye”. |
| Comment by Judah Schvimer [ 08/Dec/17 ] |
|
On second thought, we'll also have to make sure that the reporter does not send our updated optime in its liveness updates. |
| Comment by Judah Schvimer [ 08/Dec/17 ] |
|
This also would still allow priority 0 nodes to forward their progress, but that's fine since they can always reconfig the nodes to be electable if needed. Maintenance Mode will not allow nodes to forward their progress, which is probably what we want anyways. |
| Comment by Judah Schvimer [ 08/Dec/17 ] |
|
We can probably just add a check that we're in SECONDARY here. The only concern would be making sure that if we do the check and then immediately become SECONDARY, but never replicate another operation, that we still update our sync source. Based on my reading of the Reporter, it sends progress periodically even without an update (for liveness updates presumably): https://github.com/mongodb/mongo/blob/2680f414b5fd303b93e48ff5a49fdf04535f05ec/src/mongo/db/repl/reporter.cpp#L293-L302 |