Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22502

Replication Protocol 1 rollbacks are more likely during priority takeover

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL

      Description

      General note: I know that the title is too general, but this is the 3rd bug I'm opening this week. We have another one coming for 3.2.1 related to sharding which we will soon publish. We are thinking of moving out of mongodb, the reliability of 3.2 is horrible!

      2 bugs in this ticket:

      1. We removed a member using rs.remove(). After that - the removed member (of which the log is attached) - started a versioning mess and killed itself.
      filename = crash.

      2. 2nd time we got the following behavior: a member selects itself, although it doesn't need to, and causes a rollback of the other member.
      Our setup: primary, secondary and arbiter.
      Primary: rs.stepDown() for maintenance.
      Secondary takes over.
      When primary is back, it starts syncing, as you can see from the logs - during this time it receives 2 "no" votes since it is still stale, but then - it receives only 1 "yes" vote (for some reason, the arbiter is quiet) - and is elected before its time. This causes a rollback on the other node.
      All 3 nodes' logs are attached (primary, secondary, are). Please note the following lines:

      2016-02-07T09:38:14.612+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
      2016-02-07T09:38:14.612+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2arb.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
      

      and after 9 seconds, suddenly:

      2016-02-07T09:38:25.613+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
      2016-02-07T09:38:25.613+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      2016-02-07T09:38:25.614+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 4
      2016-02-07T09:38:25.614+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
      

      All members in protocol version 1. They were 0 but upgraded according to your docs ~a week ago.

        Attachments

        1. arb
          3 kB
        2. crash
          19 kB
        3. primary
          25 kB
        4. secondary
          53 kB

          Issue Links

            Activity

              People

              Assignee:
              milkie Eric Milkie
              Reporter:
              yonido Yoni Douek
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: