Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13070

Including arbiters when calculating replica set majority can break balancing / prevents fault-tolerant majority writes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Gone away
    • 2.4.9
    • None
    • Replication
    • Replication
    • ALL

    Description

      It seems like the change in replica set majority calculation introduced in SERVER-5351 broke balancing on some existing cluster setups, since it bases the strict majority on the total number of members, not the number of non-arbiter ones.

      We recently upgraded a cluster from v2.2.4 to v2.4.9, and lost our ability to balance the cluster in its original setup.

      The cluster has 20 shards, and each shard is a replica set with four members: a primary, a secondary and an arbiter in one datacenter, and a non-voting, zero-priority, hidden secondary with a 12-hour replication delay in another datacenter.

      After the upgrade, balancing the cluster failed since it was waiting for the operations to replicate to a majority (3 out of 4) of the replica set members, rather than a majority of the non-arbiter members (2 out of 3). With the third non-arbiter member being on a 12-hour delay, that didn't go very well. I expect the same would happen on individual shards if either storage member had become unavailable.

      (As a temporary fix to get the balancing going again, we removed the replication delay to the off-site secondary.)

      Not sure if this is the same issue as SERVER-12386, or just related to it.

      Attachments

        Issue Links

          Activity

            People

              backlog-server-repl Backlog - Replication Team
              filip@pingdom.com Filip Salomonsson
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: