[SERVER-13070] Including arbiters when calculating replica set majority can break balancing / prevents fault-tolerant majority writes Created: 06/Mar/14  Updated: 07/Apr/23  Resolved: 21/Dec/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Filip Salomonsson Assignee: Backlog - Replication Team
Resolution: Done Votes: 1
Labels: ElecENH, majority
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-12386 Use of arbiters prevents fault-tolera... Closed
Related
related to SERVER-15764 unit test new majority write behavior... Closed
is related to SERVER-14403 Change w:majority write concern to in... Closed
is related to SERVER-7681 Report majority number in ReplSetGetS... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

It seems like the change in replica set majority calculation introduced in SERVER-5351 broke balancing on some existing cluster setups, since it bases the strict majority on the total number of members, not the number of non-arbiter ones.

We recently upgraded a cluster from v2.2.4 to v2.4.9, and lost our ability to balance the cluster in its original setup.

The cluster has 20 shards, and each shard is a replica set with four members: a primary, a secondary and an arbiter in one datacenter, and a non-voting, zero-priority, hidden secondary with a 12-hour replication delay in another datacenter.

After the upgrade, balancing the cluster failed since it was waiting for the operations to replicate to a majority (3 out of 4) of the replica set members, rather than a majority of the non-arbiter members (2 out of 3). With the third non-arbiter member being on a 12-hour delay, that didn't go very well. I expect the same would happen on individual shards if either storage member had become unavailable.

(As a temporary fix to get the balancing going again, we removed the replication delay to the off-site secondary.)

Not sure if this is the same issue as SERVER-12386, or just related to it.



 Comments   
Comment by Eric Milkie [ 21/Dec/15 ]

Only voting nodes are counted and can satisfy a majority now.

Comment by Asya Kamsky [ 21/Sep/15 ]

I believe this problem in the described form would not exist in 3.0 as only voting nodes are counted for majority. So in given example the delayed hidden node is non-voting and therefore majority would be 2, which two data holding non-hidden nodes can satisfy.

Comment by Eric Milkie [ 12/Mar/14 ]

The goal is to do a write that will not be rolled back. An arbiter can vote but does not take writes – which means it can constitute part of a majority for election purposes but cannot assist with preventing rollbacks as part of that majority.

Comment by Filip Salomonsson [ 12/Mar/14 ]

Even if we disregard the delay and non-voting nodes, though – what's the reason for counting the arbiter(s) when the goal is writes, not votes?

Comment by Eric Milkie [ 11/Mar/14 ]

The previous calculation that the migration code was using was incorrect, and it could have resulted in writes being rolled back if part of the replica set failed. The new calculation ensures that migrations cannot be rolled back. Unfortunately, this means that in your case you need to have another data bearing node in order to satisfy the majority.
In the future, we may consider not counting delayed nodes that have 0 votes when calculating a majority, since such nodes cannot influence an election.

Generated at Thu Feb 08 03:30:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.