[SERVER-39626] Majority committed oplog entries may be rolled back on minority nodes Created: 15/Feb/19  Updated: 06/Dec/22  Resolved: 06/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 0
Labels: RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive SERVER-39626.zip    
Issue Links:
Backports
Related
related to SERVER-39367 lastOpCommitted being reset on restar... Closed
related to SERVER-39831 Never update commit point beyond last... Closed
Assigned Teams:
Replication
Operating System: ALL
Backport Requested:
v3.4
Participants:

 Description   

Imagine we have the following oplog scenario of 5 nodes (A-E), where Node E is the primary in term 1:

A: [1] [1]
B: [1]
C: [1]
D: [1]
E: [1] [1] [1]

Then, Node B steps up in term 2 with the votes from B, C and D.

A: [1] [1]
B: [1] [2]
C: [1]
D: [1]
E: [1] [1] [1]

Node E steps up again in term 3 with votes from C, D and E. Node C and D caught up the oplog.

A: [1] [1]
B: [1] [2]
C: [1] [1] [1] [3]
D: [1] [1] [1] [3]
E: [1] [1] [1] [3]

Now, Node A learns of the latest commit point from E, the entry in term 3, and it updates its commit point to the second oplog entry in term 1, but it changes the sync source to B very soon. Node A will have to roll back the second oplog entry in term 1, which is already majority committed. This case is impossible in Raft, because if B doesn't know of term 3, Node A will reject AppendEntries from B since Node A learns of term 3 when updating its commit point; if B knows of term 3, it will step down and not send AppendEntries any more.

Even though this scenario doesn't affect the correctness of "majority committed" - All future primaries will have all "majority committed" oplog entries, but it may cause Node B to hit this invariant in rollback as pointed out by judah.schvimer, since this minority node is trying to roll back a "majority committed" oplog entry. In this worst case, the secondary crashes, but we haven't seen this in the field or in testing.

One plausible solution is to advance the commit point only if the commit point is in the same term as the node's last applied OpTime, no matter where the commit point is from spanning tree or heartbeats, since the commit points are always "immediately committed", rather than "prefix committed" according to the definition in the Raft formal spec. "Immediately committed" oplog entries will never be rolled back.



 Comments   
Comment by A. Jesse Jiryu Davis [ 14/Jan/20 ]

I've attached some TLA+ files to reproduce this issue, SERVER-39626.zip.

Comment by Tess Avitabile (Inactive) [ 13/Jan/20 ]

There was an omission in the description. When Node A learns of the latest commit point from E, the entry in term 3, Node A actually updates its commit point to the second oplog entry in term 1. When a node learns a commit point from its sync source, it updates its commit point to the minimum of the sync source's commit point and its own lastApplied, due to SERVER-39831. Node A then rolls back the second oplog entry in term 1. I will update the ticket description.

Comment by A. Jesse Jiryu Davis [ 10/Jan/20 ]

In the ticket description it says "Node A learns of the latest commit point from E, the entry in term 3", but that's prohibited. Node A won't update its copy of the commit point to (term 3, index 4), since that's beyond Node A's lastApplied and its term doesn't equal Node A's term, which is 1.

I think it's possible for Node A to learn that (term 1, index 2) is committed, however. Starting at the step, "Node E steps up again in term 3 with votes from C, D and E. Node C and D caught up the oplog", then Node E can advance its commit point to (term 1, index 2), and Node A can learn the commit point from E via heartbeat. Then, as in the original description, Node A can roll back (term 1, index 2) against Node B and hit the invariant.

Comment by Tess Avitabile (Inactive) [ 06/Jan/20 ]

Closing as Won't Fix. We can reopen this issue if it's reported in the field.

Comment by Siyuan Zhou [ 25/Feb/19 ]

It's interesting to note that the window of the "committed" oplog entries that can be rolled back is small.

"Committed" entries consists of "immediately committed" and "prefix committed" entries. If an oplog entry with term T is replicated on a majority of nodes while these nodes are all in term T, then the entry is called "immediately committed". Otherwise, an entry can be considered "prefix committed" if an entry later than it on the same branch is "immediately committed". As an example, the third entry in term 1 in the above example is only "prefix committed"; however the first entry in term 1 and the entry in term 3 are "immediately committed". The terminologies are borrowed from the TLA+ spec in Raft's dissertation.

Immediately committed will never roll back on any node. The problem in this ticket only applies to the entries that are "prefix committed", so the window is pretty small. This might explain why we've never seen this in testing or in the field.

Generated at Thu Feb 08 04:52:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.