[SERVER-15254] replica set members should be able to replicate off members that don't build indexes Created: 13/Sep/14  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Zardosht Kasheff Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Participants:

 Description   

Currently, if member 'A' does not build indexes, then other members that do build indexes cannot replicate oplog data off of A. Here is why this is problematic.

Suppose member 'A' finds itself to be further ahead than all other members, because it was the only member to replicate data some data off the primary before the primary disappeared. Because other members cannot sync from A, other members cannot catch up to A. Also, because A is ahead of everyone else, A will veto every possible election of a new primary. You are stuck with a situation where no primary can be elected.

Perhaps Member::syncable() should distinguish cases where we wish to replicate oplog data or do an initial sync. It makes sense for members that don't build indexes to be ineligible for being the source of an initial sync. However, for the reasons above, they should be allowed to be the source of oplog data during normal replication



 Comments   
Comment by Charlie Page [ 18/Sep/14 ]

The root of the issue is that a node which cannot be elected can acknowledge a write concern. Only electable, data bearing nodes, should be part of a write concern as it is a necessary condition that if a node acknowledges a write it can fully represent that write in a consensus operation for the write concern to be meaningful.

w:majority can still fail if the majority of the partition only contains unelected servers which are most current. A single buildIndexes = false node can prevent a primary/ force rollbacks (depending on what decision is made) if another electable node is not equally or greater current.

Comment by Zardosht Kasheff [ 15/Sep/14 ]

Hello Eric,
Technically, the issue would not be solved, because a majority of servers may have buildIndexes set to false and have some write that others do not. Realistically, that is not likely.

But I think removing veto powers is problematic. Yes, majority write concern would still work, but practically any other replica set write concern that users may want will have problems. Suppose a write is written with REPLICA_SAFE. The user expects that as long as new elections involve all secondaries (and that only the old primary does not participate for some strange reason), the write survives. Without veto power, the only secondary to have acknowledged a write may not be able to stop an election that does not contain the write. A similar example is having three data centers with three members each, and using a write concern that states "make sure the write makes it to two out of three data centers". Assuming two data centers (6 members) participate in an election, and only one member has the write, an election may rollback the write.

This issue is related to SERVER-14885 in that both have the same underlying problem: if member 'A' can block member 'B' from becoming primary, then either A or B ought to have the power to eventually become primary, likely by having be able to sync off the other. I realize arbiters throw a wrinkle in that statement, but that wrinkle only exists if there is some existing network partition that does not get resolved.

Comment by Eric Milkie [ 15/Sep/14 ]

Hi Zardosht.
If one were to swap out the current election mechanism with one that adhered to the Rules for Servers as presented in the Raft paper (which means no veto powers), would that solve this issue?

Comment by Zardosht Kasheff [ 13/Sep/14 ]

A note I forgot to add. I did not read the code to see how rollback would be affected, but there may be some subtlety there.

Generated at Thu Feb 08 03:37:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.