[SERVER-32392] Re-evaluate sync sources if sync source steps down Created: 18/Dec/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Replication
Participants:

 Description   

Consider a 3 nodes replica set. B is primary and A and C are syncing from B. A is at optime (1,1), B is at optime (5,1), C is at optime (3,1). If there is a network partition between B and C, C will run for election, A will vote for it, C will win its election, and it will write (4,2). Eventually B will time out and step down. A does not stop syncing from B when B steps down, and will replicate all the way to (5,1). When the partition lifts, or A catches up, they will both start syncing from C and go into rollback. B is no longer a primary when A catches up, and A propagates C's term, so there’s no danger of committing writes that will roll back. This is probably not ideal though since A did not need to roll back, and A should probably switch sync sources much sooner.



 Comments   
Comment by Alex [ 20/Nov/18 ]

@Judah Thank you for taking the time to clarify this. Everything is clear now. 

Comment by Judah Schvimer [ 13/Nov/18 ]

Hi alex101,

Thank you for your question. This gets at the distinction between replicating to a majority of servers and acknowledging a write/serving a read as being "majority committed". We guarantee that if we respond "success" to a "w:majority" write or return data in a "readConcern: majority" read, that the data will not be rolled back. We do not, however, guarantee that if data arrives at a majority of nodes that it will not roll back. This is related to "Figure 8" in the raft paper (though slightly different in this exact case). In the above case, although a majority of nodes have replicated (4,1) and (5,1), we do not acknowledge them or return them in majority reads until the replication majority commit point advances to (4,1) or (5,1). The majority commit point is only advanced on primaries, and as soon as node A votes for node C, node A will only transmit progress updates with its term, causing B to step down before majority committing any new writes.

Does this answer the question?
Judah

Comment by Alex [ 13/Nov/18 ]

I'm wondering how a rollback happens even though (4,1) and (5,1) were written to a majority of nodes (A, B). Does not this lead to data loss (a user might read from a majority before the rollback happens)?

Comment by Judah Schvimer [ 19/Dec/17 ]

I agree. I'm not sure what (if anything) we want to do here, however this behavior leads to unnecessary rollbacks, which is generally not ideal (especially since rollback may start to take longer in 3.8), so I think it's worth brainstorming improvements here.

Comment by Eric Milkie [ 19/Dec/17 ]

Is this work only for optimizing the case where there is a partial network partition? I can see this change making some situations less optimal as well, as it isn't free to re-evaluate sync sources. I think we should have a discussion on whether the effort to do this is worth it.

Comment by Spencer Brody (Inactive) [ 18/Dec/17 ]

We could probably do this by clearing our sync source if our oplog tailing query gets killed with "InterruptedDueToReplStateChange"

Generated at Thu Feb 08 04:30:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.