Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61519

Term without primary should not last

    • Type: Icon: Bug Bug
    • Resolution: Won't Do
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 5.1.0, 5.0.4
    • Component/s: None
    • Labels:
      None
    • Replication
    • ALL
    • 15

      The problem reproduced in the test is that while every node behaved properly the replica set found itself in the state when no primary will be elected. In particular:

      The replicas are randomly restarted, randomly stepped up while the compatibility version is also randomly changed. The race happened when n0 initiated elections but:

      1. n2 was just killed and thus did not participate in elections
      2. n1 stepped down because it received election request from n0
      3. n0 ignored the vote from n1 because n1's config was older than n0's

      There are two ways to deal with it. The first would be to address the particular race ad-hoc, e.g. make the n1 to track that n0 election actually failed. However, this would be strange because it's n0 business to track its own election. It might be better to make n0 to run again, assuming the n1 config will eventually catch up.

      I'm thinking that a more preferable solution would be to treat as a gap in our Raft implementation, making a node to watch if for the current term no node think it's primary for certain amount of time. It might be ok to receive '"primaryId":-1' sometimes while having no info on current primary by itself. But eventually, if there no node at all that declared it knows the primary for the current term, this should trigger new elections.

      The complexity: what if the node is cut off from the others by network failure and cannot learn new primary? It might be sufficient to require that it received at least 1 more heartbeat from a node that is also not aware about a primary for this timeout. If we run 3-replica RS we are done. If we are running 5(or 5+)-replica elections it means there might be a disjoint cluster somewhere with consensus and this pair of disconnected votes won't get enough votes anyway.

      The vanilla Raft that we respect states that a replica become Candidate when it does not receive a heartbeat from primary for a timeout. The case for this error is when the primary is unknown (-1), which it seems we don't handle properly.

      Question: why our implementation allows unknown primary (-1) for a term, as reproduced by test? My understanding is that the term should not advance before majority is reached. Is that a peculiarity of our implementation that term is advanced at the election start and not election success? If that's by design I would presume it might be too much a refactoring to fix, but rather to proceed as I discussed.

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: