Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-10364

Split Brain when the Primary loses the majority of the cluster, but the cluster can still see it.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Works as Designed
    • Affects Version/s: 2.4.5
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Environment:
      ProofOfConcept: Node#1 - Primary, Node#2 - Secondary and Arbiter
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      I was able to reproduce the situation where:
      (1) The Primary is UNABLE to see the Secondary and the Arbiter
      (2) the Secondary and Arbiter ARE able to see the Primary

      I simulated this by putting the Secondary and the arbiter on one server, and the Primary on a different server.
      I tested with our current production version (2.0.7) and also with 2.2.5 and 2.4.5

      Show
      I was able to reproduce the situation where: (1) The Primary is UNABLE to see the Secondary and the Arbiter (2) the Secondary and Arbiter ARE able to see the Primary I simulated this by putting the Secondary and the arbiter on one server, and the Primary on a different server. I tested with our current production version (2.0.7) and also with 2.2.5 and 2.4.5

      Description

      Last week, there was a failure of AWS DNS resolution which caused a specific Amazon Availability Zone to not be able to resolve DNS. Other AZ's WERE able to resolve DNS, including records of hosts in the "DNS-failed" zone.

      In a nutshell, we have the following situation which led to both nodes in "SECONDARY" state:

      PRIMARY (db01srv02) - suddenly can't see the SECONDARY or the ARBITER. It steps down.
      SECONDARY (db01srv01) - CAN see the Primary and the Arbiter. It refuses to elect itself because "db01srv02.local.:20001 would veto"

      (n.b. - after upgrading to 2.4.5, I now get the more descriptive error "Sun Jul 28 12:43:36 [rsMgr] not electing self, db01srv02.local.:20001 would veto with 'I don't think db01srv01.local.:10001 is electable'"

      Disclaimer - I'm not a DB Expert, so this may be expected behavior for some reason....

        Attachments

        1. mongoDBSplitBrainLog.txt
          16 kB
          Michael Tewner
        2. mongoDBSplitBrainLog-2.4.5.txt
          17 kB
          Michael Tewner

          Issue Links

            Activity

              People

              Assignee:
              matt.dannenberg Matt Dannenberg
              Reporter:
              tewner Michael Tewner
              Participants:
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: