[SERVER-14539] Full consensus arbiter (i.e. uses an oplog) Created: 12/Jul/14  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Charlie Page Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-20820 Arbiter with instant replay Backlog
is related to SERVER-26717 PSA flapping during netsplit when usi... Closed
is related to SERVER-18453 Avoiding Rollbacks in new Raft based ... Closed
Assigned Teams:
Replication
Participants:
Case:

 Description   

Allow arbiters to fully participate in consensus with an oplog (like view stamped replication). --ArbiterWithOplog=<size in mb of oplog>

This would mean that w:majority including arbiters works as expected; No rollbacks assuming no one falls off the oplog. Doing so would prevent rollbacks with network flapping and the use of an arbiter.

This would also mean that a primary may not be elected with a majority up until one of the data bearing nodes has replicated the oplog of the arbiter.

Let's say your replica set configuration calls for N data bearing nodes and M arbiters.

Consider the case where M and N are both positive. If exactly ceil(N/2) data nodes go offline but no arbiters do, you'll still have
a primary, but no w:majority writes can be acknowledged. Therefore, any write into a replica set in such a degraded state is subject to eventual rollback.

Observe that the common, minimally expensive mode of operation for replica sets is the above with N=2 and M=1. If either data node is offline, no writes are rollback-proof (though, barring a second failure, none of them will get rolled back). Further, no write concern stronger than w=1 will be confirmed to the application.

At present, all replica sets with greater than zero arbiters will have such pathological modes of operation around sufficiently many data node failures.

This leaves the operator with 3 choices in a 3 node set with an arbiter (these are the same choices in a larger set just easier to describe with a concrete example):
1)w:1 and accept rollbacks (i.e. silently lose data, it is currently possible to loss an arbitrary amount of data via multiple rollbacks)
2)w:2 and accept the system goes down with a single node
3)Monitor rs.status and dynamically change write concern before every write to try and get the best of both worlds

This would create choice 4:
4)w:2 and know that all committed writes won't be rolled back and that loss of a single node won't take down the set (limited by oplog time but it can be given a week).

For completeness, the 3 choices in a replica set for write concern that can be mapped to larger replicas:
w:1, less than majority.
w:2, majority.
w:3, greater than majority.



 Comments   
Comment by Jason R. Coombs [ 10/May/16 ]

This feature would also serve another purpose - to allow the creation of nodes with unusually large oplogs for the purposes of initializing new replicas where the current oplog size on members of the set is too small and is overrun before a new member can be synced. Such a feature would save us dozens of hours every year and make our replica sets so much easier to manage.

Generated at Thu Feb 08 03:35:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.