Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28376

ReplSetTest.initiate() should call awaitReplication before requesting node 0 to step up as primary

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4.4, 3.5.6
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Completed:
    • Sprint:
      Repl 2017-04-17
    • Linked BF Score:
      0

      Description

      During ReplSetTest initialization, when elections happen while adding nodes to the replica set, it is possible for node 0 (the intended primary node after initialization) to lag the other nodes before it is asked to step up.

      https://github.com/mongodb/mongo/blob/231e760c744013fe68fe863a7d0315148c69047a/src/mongo/shell/replsettest.js#L771

      replsettest.js

      771
                  this.stepUp(this.nodes[0]);
      772
              }
      773
       
      774
              // Setup authentication if running test with authentication
      775
              if ((jsTestOptions().keyFile) && cmdKey == 'replSetInitiate') {
      776
                  master = this.getPrimary();
      777
                  jsTest.authenticateNodes(this.nodes);
      778
              }
      779
          };
      780
       
      781
          this.stepUp = function(node) {
      782
              this.awaitSecondaryNodes();
      783
              this.awaitNodesAgreeOnPrimary();
      784
              if (this.getPrimary() === node) {
      785
                  return;
      786
              }
      

      This may result in the step up phase of ReplSetTest,initiate() failing to elect node 0 with a "candidate's data is staler than mine" reason.

      [js_test:replsetprio1] 2017-03-15T13:03:02.788+0000 d20010| 2017-03-15T13:03:02.787+0000 I REPL     [conn2] stepping down from primary, because a new term has begun: 2
      [js_test:replsetprio1] 2017-03-15T13:03:02.788+0000 d20010| 2017-03-15T13:03:02.787+0000 I REPL     [replExecDBWorker-2] transition to SECONDARY
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:02.802+0000 d20010| 2017-03-15T13:03:02.799+0000 D REPL_HB  [conn2] Received heartbeat request from mci-osx1010-10.build.10gen.cc:20011, { replSetHeartbeat: "testSet", configVersion: 2, from: "mci-osx1010-10.build.10gen.cc:20011", fromId: 1, term: 2 }
      [js_test:replsetprio1] 2017-03-15T13:03:02.802+0000 d20010| 2017-03-15T13:03:02.799+0000 D REPL_HB  [conn2] Processed heartbeat from mci-osx1010-10.build.10gen.cc:20011 and generated response, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "testSet", term: 1, durableOpTime: { ts: Timestamp 1489582961000|2, t: 1 }, opTime: { ts: Timestamp 1489582961000|2, t: 1 } }
       
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:03.315+0000 d20010| 2017-03-15T13:03:03.315+0000 D REPL_HB  [ReplicationExecutor] Sending heartbeat (requestId: 52) to mci-osx1010-10.build.10gen.cc:20012, { replSetHeartbeat: "testSet", configVersion: 2, from: "mci-osx1010-10.build.10gen.cc:20010", fromId: 0, term: 1 }
      [js_test:replsetprio1] 2017-03-15T13:03:03.316+0000 d20010| 2017-03-15T13:03:03.315+0000 D REPL_HB  [ReplicationExecutor] Sending heartbeat (requestId: 53) to mci-osx1010-10.build.10gen.cc:20011, { replSetHeartbeat: "testSet", configVersion: 2, from: "mci-osx1010-10.build.10gen.cc:20010", fromId: 0, term: 1 }
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:03.318+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Received response to heartbeat (requestId: 52) from mci-osx1010-10.build.10gen.cc:20012, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "testSet", term: 2, primaryId: 0, durableOpTime: { ts: Timestamp 1489582961000|2, t: 1 }, opTime: { ts: Timestamp 1489582961000|2, t: 1 }, operationTime: Timestamp 1489582982000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:03.318+0000 d20010| 2017-03-15T13:03:03.316+0000 I REPL     [ReplicationExecutor] Member mci-osx1010-10.build.10gen.cc:20012 is now in state SECONDARY
      [js_test:replsetprio1] 2017-03-15T13:03:03.318+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Scheduling heartbeat to mci-osx1010-10.build.10gen.cc:20012 at 2017-03-15T13:03:08.316Z
      [js_test:replsetprio1] 2017-03-15T13:03:03.319+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Received response to heartbeat (requestId: 53) from mci-osx1010-10.build.10gen.cc:20011, { ok: 1.0, electionTime: new Date(6397710192368156673), state: 1, v: 2, hbmsg: "", set: "testSet", term: 2, primaryId: 1, durableOpTime: { ts: Timestamp 1489582961000|2, t: 1 }, opTime: { ts: Timestamp 1489582961000|2, t: 1 }, operationTime: Timestamp 1489582982000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:03.319+0000 d20010| 2017-03-15T13:03:03.316+0000 I REPL     [ReplicationExecutor] Member mci-osx1010-10.build.10gen.cc:20011 is now in state PRIMARY
      [js_test:replsetprio1] 2017-03-15T13:03:03.319+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Scheduling heartbeat to mci-osx1010-10.build.10gen.cc:20011 at 2017-03-15T13:03:08.316Z
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:07.781+0000 d20010| 2017-03-15T13:03:07.781+0000 I COMMAND  [conn11] Received replSetStepUp request
      [js_test:replsetprio1] 2017-03-15T13:03:07.781+0000 d20010| 2017-03-15T13:03:07.781+0000 I REPL     [conn11] Starting an election due to step up request
      [js_test:replsetprio1] 2017-03-15T13:03:07.782+0000 d20010| 2017-03-15T13:03:07.781+0000 I REPL     [conn11] conducting a dry run election to see if we could be elected
      [js_test:replsetprio1] 2017-03-15T13:03:07.782+0000 d20010| 2017-03-15T13:03:07.781+0000 I REPL     [ReplicationExecutor] VoteRequester(term 2 dry run) received a no vote from mci-osx1010-10.build.10gen.cc:20011 with reason "candidate's data is staler than mine"; response message: { term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0, operationTime: Timestamp 1489582984000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:07.783+0000 d20010| 2017-03-15T13:03:07.782+0000 I REPL     [ReplicationExecutor] VoteRequester(term 2 dry run) received a no vote from mci-osx1010-10.build.10gen.cc:20012 with reason "candidate's data is staler than mine"; response message: { term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0, operationTime: Timestamp 1489582984000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:07.783+0000 d20010| 2017-03-15T13:03:07.782+0000 I REPL     [ReplicationExecutor] not running for primary, we received insufficient votes
      [js_test:replsetprio1] 2017-03-15T13:03:07.783+0000 d20010| 2017-03-15T13:03:07.782+0000 I COMMAND  [conn11] replSetStepUp request failed  :: caused by :: CommandFailed: Election failed.
      

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: