Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8483

reconfig may cause problem re-electing primary

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • ALL

      Setup is this:

      Replica set with 4 nodes, priority 0 except for the first node A (only the first node can be primary).

      Nodes B and C slaveDelayed by 0 or 40s, alternating via reconfigs.

      Node D blackholed from node A, symmetrically (A can't talk to D, D can't talk to A).

      At first node D correctly switches sync'ing between nodes A and B, depending on which is delayed. Each time the reconfig happens node A drops to secondary, then is elected primary.

      At some point though it seems impossible for node A to become the primary again after a reconfig. There is a strange message in the logs of node A:

       m31000| Thu Jan 17 17:05:00.147 [rsMgr] not electing self, 127.0.0.4:31002 would veto with '127.0.0.2:31000 is trying to elect itself but 127.0.0.2:31000 is already primary and more up-to-date'
      

      Test to reproduce and output from two runs is attached below (with replSetStatus from all nodes every 5s during the problem period).

        1. currentTest_failure_same_host_veto.txt
          177 kB
        2. currentTest.txt
          569 kB
        3. sync_change_source.js
          3 kB

            Assignee:
            davide.italiano Davide Italiano
            Reporter:
            greg_10gen Greg Studer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: