-
Type: Bug
-
Resolution: Incomplete
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
ALL
Setup is this:
Replica set with 4 nodes, priority 0 except for the first node A (only the first node can be primary).
Nodes B and C slaveDelayed by 0 or 40s, alternating via reconfigs.
Node D blackholed from node A, symmetrically (A can't talk to D, D can't talk to A).
At first node D correctly switches sync'ing between nodes A and B, depending on which is delayed. Each time the reconfig happens node A drops to secondary, then is elected primary.
At some point though it seems impossible for node A to become the primary again after a reconfig. There is a strange message in the logs of node A:
m31000| Thu Jan 17 17:05:00.147 [rsMgr] not electing self, 127.0.0.4:31002 would veto with '127.0.0.2:31000 is trying to elect itself but 127.0.0.2:31000 is already primary and more up-to-date'
Test to reproduce and output from two runs is attached below (with replSetStatus from all nodes every 5s during the problem period).