Prevent a majority of nodes in a replica set from going into rollback

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      There's an interleaving between an election and oplog fetching that makes it possible for a majority of nodes in a replica set to go into rollback. Here's an example:

      1. Node 0 is primary and it is at (timestamp: 10, term: 1)
      2. Node 1 decides to run for an election and its last applied is (timestamp: 10, term: 1)
      3. Node 2 receives a vote request from node 1 and says yes because its last applied is also (timestamp: 10, term: 1)
      4. Node 2 wins the election and starts primary catchup and based on heartbeats its target optime is (timestamp: 10, term: 1). It writes a new term oplog entry at (timestamp 11, term:2)
      5. Node 0 accepts a write at (timestamp 12, term:1)
      6. Node 1 replicates the write at (timestamp 12, term:1) because it hasn't changed its sync source yet
      7. Before Node 0 hears back from Node 1, Node 0 steps down and tries to sync from node 0, but realizes it needs rollback (timestamp 12, term:1)
      8. Node 1 syncs from node 0 and realizes it needs to rollback (timestamp 12, term:1)

      Rollback can be a very slow operation that can takes tens of minutes. In this situation, the multiple rollbacks cause write unavailability until at least one of the nodes can return to the secondary state.

      We should be careful to make sure that any solutions in this space do not make other, more common, operations much worse.

            Assignee:
            Unassigned
            Reporter:
            Samyukta Lanka
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: