[SERVER-32392] Re-evaluate sync sources if sync source steps down Created: 18/Dec/17 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Backlog - Replication Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Replication
|
||||
| Participants: | |||||
| Description |
|
Consider a 3 nodes replica set. B is primary and A and C are syncing from B. A is at optime (1,1), B is at optime (5,1), C is at optime (3,1). If there is a network partition between B and C, C will run for election, A will vote for it, C will win its election, and it will write (4,2). Eventually B will time out and step down. A does not stop syncing from B when B steps down, and will replicate all the way to (5,1). When the partition lifts, or A catches up, they will both start syncing from C and go into rollback. B is no longer a primary when A catches up, and A propagates C's term, so there’s no danger of committing writes that will roll back. This is probably not ideal though since A did not need to roll back, and A should probably switch sync sources much sooner. |
| Comments |
| Comment by Alex [ 20/Nov/18 ] |
|
@Judah Thank you for taking the time to clarify this. Everything is clear now. |
| Comment by Judah Schvimer [ 13/Nov/18 ] |
|
Hi alex101, Thank you for your question. This gets at the distinction between replicating to a majority of servers and acknowledging a write/serving a read as being "majority committed". We guarantee that if we respond "success" to a "w:majority" write or return data in a "readConcern: majority" read, that the data will not be rolled back. We do not, however, guarantee that if data arrives at a majority of nodes that it will not roll back. This is related to "Figure 8" in the raft paper (though slightly different in this exact case). In the above case, although a majority of nodes have replicated (4,1) and (5,1), we do not acknowledge them or return them in majority reads until the replication majority commit point advances to (4,1) or (5,1). The majority commit point is only advanced on primaries, and as soon as node A votes for node C, node A will only transmit progress updates with its term, causing B to step down before majority committing any new writes. Does this answer the question? |
| Comment by Alex [ 13/Nov/18 ] |
|
I'm wondering how a rollback happens even though (4,1) and (5,1) were written to a majority of nodes (A, B). Does not this lead to data loss (a user might read from a majority before the rollback happens)? |
| Comment by Judah Schvimer [ 19/Dec/17 ] |
|
I agree. I'm not sure what (if anything) we want to do here, however this behavior leads to unnecessary rollbacks, which is generally not ideal (especially since rollback may start to take longer in 3.8), so I think it's worth brainstorming improvements here. |
| Comment by Eric Milkie [ 19/Dec/17 ] |
|
Is this work only for optimizing the case where there is a partial network partition? I can see this change making some situations less optimal as well, as it isn't free to re-evaluate sync sources. I think we should have a discussion on whether the effort to do this is worth it. |
| Comment by Spencer Brody (Inactive) [ 18/Dec/17 ] |
|
We could probably do this by clearing our sync source if our oplog tailing query gets killed with "InterruptedDueToReplStateChange" |