[SERVER-39626] Majority committed oplog entries may be rolled back on minority nodes Created: 15/Feb/19 Updated: 06/Dec/22 Resolved: 06/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | Backlog - Replication Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | RF | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v3.4
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Imagine we have the following oplog scenario of 5 nodes (A-E), where Node E is the primary in term 1:
Then, Node B steps up in term 2 with the votes from B, C and D.
Node E steps up again in term 3 with votes from C, D and E. Node C and D caught up the oplog.
Now, Node A learns of the latest commit point from E, the entry in term 3, and it updates its commit point to the second oplog entry in term 1, but it changes the sync source to B very soon. Node A will have to roll back the second oplog entry in term 1, which is already majority committed. This case is impossible in Raft, because if B doesn't know of term 3, Node A will reject AppendEntries from B since Node A learns of term 3 when updating its commit point; if B knows of term 3, it will step down and not send AppendEntries any more. Even though this scenario doesn't affect the correctness of "majority committed" - All future primaries will have all "majority committed" oplog entries, but it may cause Node B to hit this invariant in rollback as pointed out by judah.schvimer, since this minority node is trying to roll back a "majority committed" oplog entry. In this worst case, the secondary crashes, but we haven't seen this in the field or in testing. One plausible solution is to advance the commit point only if the commit point is in the same term as the node's last applied OpTime, no matter where the commit point is from spanning tree or heartbeats, since the commit points are always "immediately committed", rather than "prefix committed" according to the definition in the Raft formal spec. "Immediately committed" oplog entries will never be rolled back. |
| Comments |
| Comment by A. Jesse Jiryu Davis [ 14/Jan/20 ] |
|
I've attached some TLA+ files to reproduce this issue, |
| Comment by Tess Avitabile (Inactive) [ 13/Jan/20 ] |
|
There was an omission in the description. When Node A learns of the latest commit point from E, the entry in term 3, Node A actually updates its commit point to the second oplog entry in term 1. When a node learns a commit point from its sync source, it updates its commit point to the minimum of the sync source's commit point and its own lastApplied, due to |
| Comment by A. Jesse Jiryu Davis [ 10/Jan/20 ] |
|
In the ticket description it says "Node A learns of the latest commit point from E, the entry in term 3", but that's prohibited. Node A won't update its copy of the commit point to (term 3, index 4), since that's beyond Node A's lastApplied and its term doesn't equal Node A's term, which is 1. I think it's possible for Node A to learn that (term 1, index 2) is committed, however. Starting at the step, "Node E steps up again in term 3 with votes from C, D and E. Node C and D caught up the oplog", then Node E can advance its commit point to (term 1, index 2), and Node A can learn the commit point from E via heartbeat. Then, as in the original description, Node A can roll back (term 1, index 2) against Node B and hit the invariant. |
| Comment by Tess Avitabile (Inactive) [ 06/Jan/20 ] |
|
Closing as Won't Fix. We can reopen this issue if it's reported in the field. |
| Comment by Siyuan Zhou [ 25/Feb/19 ] |
|
It's interesting to note that the window of the "committed" oplog entries that can be rolled back is small. "Committed" entries consists of "immediately committed" and "prefix committed" entries. If an oplog entry with term T is replicated on a majority of nodes while these nodes are all in term T, then the entry is called "immediately committed". Otherwise, an entry can be considered "prefix committed" if an entry later than it on the same branch is "immediately committed". As an example, the third entry in term 1 in the above example is only "prefix committed"; however the first entry in term 1 and the entry in term 3 are "immediately committed". The terminologies are borrowed from the TLA+ spec in Raft's dissertation. Immediately committed will never roll back on any node. The problem in this ticket only applies to the entries that are "prefix committed", so the window is pretty small. This might explain why we've never seen this in testing or in the field. |