Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28376

ReplSetTest.initiate() should call awaitReplication before requesting node 0 to step up as primary

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.4.4, 3.5.6
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Repl 2017-04-17
    • 0

      During ReplSetTest initialization, when elections happen while adding nodes to the replica set, it is possible for node 0 (the intended primary node after initialization) to lag the other nodes before it is asked to step up.

      https://github.com/mongodb/mongo/blob/231e760c744013fe68fe863a7d0315148c69047a/src/mongo/shell/replsettest.js#L771

      replsettest.js
                  this.stepUp(this.nodes[0]);
              }
      
              // Setup authentication if running test with authentication
              if ((jsTestOptions().keyFile) && cmdKey == 'replSetInitiate') {
                  master = this.getPrimary();
                  jsTest.authenticateNodes(this.nodes);
              }
          };
      
          this.stepUp = function(node) {
              this.awaitSecondaryNodes();
              this.awaitNodesAgreeOnPrimary();
              if (this.getPrimary() === node) {
                  return;
              }
      

      This may result in the step up phase of ReplSetTest,initiate() failing to elect node 0 with a "candidate's data is staler than mine" reason.

      [js_test:replsetprio1] 2017-03-15T13:03:02.788+0000 d20010| 2017-03-15T13:03:02.787+0000 I REPL     [conn2] stepping down from primary, because a new term has begun: 2
      [js_test:replsetprio1] 2017-03-15T13:03:02.788+0000 d20010| 2017-03-15T13:03:02.787+0000 I REPL     [replExecDBWorker-2] transition to SECONDARY
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:02.802+0000 d20010| 2017-03-15T13:03:02.799+0000 D REPL_HB  [conn2] Received heartbeat request from mci-osx1010-10.build.10gen.cc:20011, { replSetHeartbeat: "testSet", configVersion: 2, from: "mci-osx1010-10.build.10gen.cc:20011", fromId: 1, term: 2 }
      [js_test:replsetprio1] 2017-03-15T13:03:02.802+0000 d20010| 2017-03-15T13:03:02.799+0000 D REPL_HB  [conn2] Processed heartbeat from mci-osx1010-10.build.10gen.cc:20011 and generated response, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "testSet", term: 1, durableOpTime: { ts: Timestamp 1489582961000|2, t: 1 }, opTime: { ts: Timestamp 1489582961000|2, t: 1 } }
      
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:03.315+0000 d20010| 2017-03-15T13:03:03.315+0000 D REPL_HB  [ReplicationExecutor] Sending heartbeat (requestId: 52) to mci-osx1010-10.build.10gen.cc:20012, { replSetHeartbeat: "testSet", configVersion: 2, from: "mci-osx1010-10.build.10gen.cc:20010", fromId: 0, term: 1 }
      [js_test:replsetprio1] 2017-03-15T13:03:03.316+0000 d20010| 2017-03-15T13:03:03.315+0000 D REPL_HB  [ReplicationExecutor] Sending heartbeat (requestId: 53) to mci-osx1010-10.build.10gen.cc:20011, { replSetHeartbeat: "testSet", configVersion: 2, from: "mci-osx1010-10.build.10gen.cc:20010", fromId: 0, term: 1 }
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:03.318+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Received response to heartbeat (requestId: 52) from mci-osx1010-10.build.10gen.cc:20012, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "testSet", term: 2, primaryId: 0, durableOpTime: { ts: Timestamp 1489582961000|2, t: 1 }, opTime: { ts: Timestamp 1489582961000|2, t: 1 }, operationTime: Timestamp 1489582982000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:03.318+0000 d20010| 2017-03-15T13:03:03.316+0000 I REPL     [ReplicationExecutor] Member mci-osx1010-10.build.10gen.cc:20012 is now in state SECONDARY
      [js_test:replsetprio1] 2017-03-15T13:03:03.318+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Scheduling heartbeat to mci-osx1010-10.build.10gen.cc:20012 at 2017-03-15T13:03:08.316Z
      [js_test:replsetprio1] 2017-03-15T13:03:03.319+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Received response to heartbeat (requestId: 53) from mci-osx1010-10.build.10gen.cc:20011, { ok: 1.0, electionTime: new Date(6397710192368156673), state: 1, v: 2, hbmsg: "", set: "testSet", term: 2, primaryId: 1, durableOpTime: { ts: Timestamp 1489582961000|2, t: 1 }, opTime: { ts: Timestamp 1489582961000|2, t: 1 }, operationTime: Timestamp 1489582982000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:03.319+0000 d20010| 2017-03-15T13:03:03.316+0000 I REPL     [ReplicationExecutor] Member mci-osx1010-10.build.10gen.cc:20011 is now in state PRIMARY
      [js_test:replsetprio1] 2017-03-15T13:03:03.319+0000 d20010| 2017-03-15T13:03:03.316+0000 D REPL_HB  [ReplicationExecutor] Scheduling heartbeat to mci-osx1010-10.build.10gen.cc:20011 at 2017-03-15T13:03:08.316Z
      ...
      [js_test:replsetprio1] 2017-03-15T13:03:07.781+0000 d20010| 2017-03-15T13:03:07.781+0000 I COMMAND  [conn11] Received replSetStepUp request
      [js_test:replsetprio1] 2017-03-15T13:03:07.781+0000 d20010| 2017-03-15T13:03:07.781+0000 I REPL     [conn11] Starting an election due to step up request
      [js_test:replsetprio1] 2017-03-15T13:03:07.782+0000 d20010| 2017-03-15T13:03:07.781+0000 I REPL     [conn11] conducting a dry run election to see if we could be elected
      [js_test:replsetprio1] 2017-03-15T13:03:07.782+0000 d20010| 2017-03-15T13:03:07.781+0000 I REPL     [ReplicationExecutor] VoteRequester(term 2 dry run) received a no vote from mci-osx1010-10.build.10gen.cc:20011 with reason "candidate's data is staler than mine"; response message: { term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0, operationTime: Timestamp 1489582984000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:07.783+0000 d20010| 2017-03-15T13:03:07.782+0000 I REPL     [ReplicationExecutor] VoteRequester(term 2 dry run) received a no vote from mci-osx1010-10.build.10gen.cc:20012 with reason "candidate's data is staler than mine"; response message: { term: 2, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0, operationTime: Timestamp 1489582984000|1 }
      [js_test:replsetprio1] 2017-03-15T13:03:07.783+0000 d20010| 2017-03-15T13:03:07.782+0000 I REPL     [ReplicationExecutor] not running for primary, we received insufficient votes
      [js_test:replsetprio1] 2017-03-15T13:03:07.783+0000 d20010| 2017-03-15T13:03:07.782+0000 I COMMAND  [conn11] replSetStepUp request failed  :: caused by :: CommandFailed: Election failed.
      

            Assignee:
            benety.goh@mongodb.com Benety Goh
            Reporter:
            benety.goh@mongodb.com Benety Goh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: