Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8483

reconfig may cause problem re-electing primary

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
    • Operating System:
      ALL

      Description

      Setup is this:

      Replica set with 4 nodes, priority 0 except for the first node A (only the first node can be primary).

      Nodes B and C slaveDelayed by 0 or 40s, alternating via reconfigs.

      Node D blackholed from node A, symmetrically (A can't talk to D, D can't talk to A).

      At first node D correctly switches sync'ing between nodes A and B, depending on which is delayed. Each time the reconfig happens node A drops to secondary, then is elected primary.

      At some point though it seems impossible for node A to become the primary again after a reconfig. There is a strange message in the logs of node A:

       m31000| Thu Jan 17 17:05:00.147 [rsMgr] not electing self, 127.0.0.4:31002 would veto with '127.0.0.2:31000 is trying to elect itself but 127.0.0.2:31000 is already primary and more up-to-date'

      Test to reproduce and output from two runs is attached below (with replSetStatus from all nodes every 5s during the problem period).

        Attachments

        1. currentTest_failure_same_host_veto.txt
          177 kB
        2. currentTest.txt
          569 kB
        3. sync_change_source.js
          3 kB

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: