[SERVER-10375] DNS failures can cause a primary-less state that wouldn't exist if a node had gone down entirely Created: 30/Jul/13 Updated: 14/Apr/16 Resolved: 15/Oct/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Dannenberg | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 8 |
| Labels: | elections | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Steps To Reproduce: | Build a cluster of three nodes between two servers (SECONDARY and ARBITER on one server, PRIMARY on the other) Remove the entries for the SECONDARY and ARBITER hosts from the PRIMARY's /etc/hosts, simulating the loss of DNS resolution for that node The SECONDARY and ARBITER still have /etc/hosts records for the PRIMARY (simulating that DNS still works for them) The PRIMARY will step down, but the SECONDARY will not run for election as the (now former) PRIMARY would veto. |
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
If you end up with a one-way DNS partition, the PRIMARY will step down, but the SECONDARYs will not run for election as they believe the (former) PRIMARY will veto. Maybe, in the case that a node cannot see a candidate, they should not veto and instead vote when the actual election starts. |
| Comments |
| Comment by Matt Dannenberg [ 15/Oct/15 ] |
|
The new election protocol does not have this problem as nodes will not veto (or vote against) candidates based on whether or not they believe the candidate is alive. |
| Comment by Alexander Komyagin [ 12/Sep/13 ] |
|
After additional conversation with mattd@10gen.com, it looks like that using priorities in the replica set can add some troubles here. Consider slightly modified original case:
Even if the (now former) PRIMARY would not veto, the ARBITER would still do that since it can see the (now former) PRIMARY and the (now former) PRIMARY has higher priority. I can not see any obvious way to alleviate this issue. One thing that comes to mind is that on each node we can detect asymmetric network split through the heartbeat (the same way we log "node ... thinks that we are down") and, if it was detected, do not veto elections because of the priority. -Alex |
| Comment by Matt Dannenberg [ 31/Jul/13 ] |
|
Linked the incorrect ticket previously. Good catch, Michael! |
| Comment by Michael Tewner [ 31/Jul/13 ] |
|
The original issue should probably be linked: |