Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22502

Replication Protocol 1 rollbacks are more likely during priority takeover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical - P2
    • Resolution: Duplicate
    • None
    • None
    • Replication
    • Fully Compatible
    • ALL

    Description

      General note: I know that the title is too general, but this is the 3rd bug I'm opening this week. We have another one coming for 3.2.1 related to sharding which we will soon publish. We are thinking of moving out of mongodb, the reliability of 3.2 is horrible!

      2 bugs in this ticket:

      1. We removed a member using rs.remove(). After that - the removed member (of which the log is attached) - started a versioning mess and killed itself.
      filename = crash.

      2. 2nd time we got the following behavior: a member selects itself, although it doesn't need to, and causes a rollback of the other member.
      Our setup: primary, secondary and arbiter.
      Primary: rs.stepDown() for maintenance.
      Secondary takes over.
      When primary is back, it starts syncing, as you can see from the logs - during this time it receives 2 "no" votes since it is still stale, but then - it receives only 1 "yes" vote (for some reason, the arbiter is quiet) - and is elected before its time. This causes a rollback on the other node.
      All 3 nodes' logs are attached (primary, secondary, are). Please note the following lines:

      2016-02-07T09:38:14.612+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
      2016-02-07T09:38:14.612+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2arb.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
      

      and after 9 seconds, suddenly:

      2016-02-07T09:38:25.613+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
      2016-02-07T09:38:25.613+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      2016-02-07T09:38:25.614+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 4
      2016-02-07T09:38:25.614+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
      

      All members in protocol version 1. They were 0 but upgraded according to your docs ~a week ago.

      Attachments

        1. arb
          3 kB
          Yoni Douek
        2. crash
          19 kB
          Yoni Douek
        3. primary
          25 kB
          Yoni Douek
        4. secondary
          53 kB
          Yoni Douek

        Issue Links

          Activity

            People

              milkie@mongodb.com Eric Milkie
              yonido Yoni Douek
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: