Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-39626

Majority committed oplog entries may be rolled back on minority nodes

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
    • Replication
    • ALL
    • v3.4

      Imagine we have the following oplog scenario of 5 nodes (A-E), where Node E is the primary in term 1:

      A: [1] [1]
      B: [1]
      C: [1]
      D: [1]
      E: [1] [1] [1]
      

      Then, Node B steps up in term 2 with the votes from B, C and D.

      A: [1] [1]
      B: [1] [2]
      C: [1]
      D: [1]
      E: [1] [1] [1]
      

      Node E steps up again in term 3 with votes from C, D and E. Node C and D caught up the oplog.

      A: [1] [1]
      B: [1] [2]
      C: [1] [1] [1] [3]
      D: [1] [1] [1] [3]
      E: [1] [1] [1] [3]
      

      Now, Node A learns of the latest commit point from E, the entry in term 3, and it updates its commit point to the second oplog entry in term 1, but it changes the sync source to B very soon. Node A will have to roll back the second oplog entry in term 1, which is already majority committed. This case is impossible in Raft, because if B doesn't know of term 3, Node A will reject AppendEntries from B since Node A learns of term 3 when updating its commit point; if B knows of term 3, it will step down and not send AppendEntries any more.

      Even though this scenario doesn't affect the correctness of "majority committed" - All future primaries will have all "majority committed" oplog entries, but it may cause Node B to hit this invariant in rollback as pointed out by judah.schvimer, since this minority node is trying to roll back a "majority committed" oplog entry. In this worst case, the secondary crashes, but we haven't seen this in the field or in testing.

      One plausible solution is to advance the commit point only if the commit point is in the same term as the node's last applied OpTime, no matter where the commit point is from spanning tree or heartbeats, since the commit points are always "immediately committed", rather than "prefix committed" according to the definition in the Raft formal spec. "Immediately committed" oplog entries will never be rolled back.

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            siyuan.zhou@mongodb.com Siyuan Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            17 Start watching this issue

              Created:
              Updated:
              Resolved: