Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33747

Arbiter tries to start data replication if cannot find itself in config after restart

    • Fully Compatible
    • ALL
    • v4.4, v4.2, v4.0, v3.6, v3.4
    • Hide
      /*
       * An arbiter that is stopped and restarted on a different port and rejoins the
       * replica set causes an invariant failure because it tries to do an initial sync. See HELP-6042.
       */
      (function() {
          "use strict";
          var replTest = new ReplSetTest({name: 'test', nodes: 3});
          replTest.startSet();
          var nodes = replTest.nodeList();
          var config = {
              "_id": "test",
              "members": [
                  {"_id": 0, "host": nodes[0]},
                  {"_id": 1, "host": nodes[1]},
                  {"_id": 2, "host": nodes[2], arbiterOnly: true}
              ]
          };
          replTest.initiate(config);
      
          let primary = replTest.getPrimary();
          replTest.awaitReplication();
          replTest.awaitSecondaryNodes();
      
          var arbiterId = 2;
          var newPort = 45678;
      
          jsTestLog("Restarting the arbiter node on a new port: " + newPort);
          replTest.stop(arbiterId);
          replTest.start(arbiterId, {port: newPort}, true);
      
          jsTestLog("Reconfiguring the set to include the arbiter on the new port.");
          var config = primary.getDB("local").system.replset.findOne();
      
          jsTestLog("Original config:");
          jsTestLog(tojson(config));
      
          var hostname = config.members[arbiterId].host.split(":")[0];
          config.version++;
          config.members[arbiterId].host = hostname + ":" + newPort;
      
          jsTestLog("New config:");
          jsTestLog(tojson(config));
      
          assert.commandWorked(primary.getDB("admin").runCommand({replSetReconfig: config}));
          replTest.awaitReplication();
          replTest.awaitNodesAgreeOnConfigVersion();
      
          replTest.stopSet();
      }());
      
      Show
      /* * An arbiter that is stopped and restarted on a different port and rejoins the * replica set causes an invariant failure because it tries to do an initial sync. See HELP-6042. */ ( function () { "use strict" ; var replTest = new ReplSetTest({name: 'test' , nodes: 3}); replTest.startSet(); var nodes = replTest.nodeList(); var config = { "_id" : "test" , "members" : [ { "_id" : 0, "host" : nodes[0]}, { "_id" : 1, "host" : nodes[1]}, { "_id" : 2, "host" : nodes[2], arbiterOnly: true } ] }; replTest.initiate(config); let primary = replTest.getPrimary(); replTest.awaitReplication(); replTest.awaitSecondaryNodes(); var arbiterId = 2; var newPort = 45678; jsTestLog( "Restarting the arbiter node on a new port: " + newPort); replTest.stop(arbiterId); replTest.start(arbiterId, {port: newPort}, true ); jsTestLog( "Reconfiguring the set to include the arbiter on the new port." ); var config = primary.getDB( "local" ).system.replset.findOne(); jsTestLog( "Original config:" ); jsTestLog(tojson(config)); var hostname = config.members[arbiterId].host.split( ":" )[0]; config.version++; config.members[arbiterId].host = hostname + ":" + newPort; jsTestLog( "New config:" ); jsTestLog(tojson(config)); assert.commandWorked(primary.getDB( "admin" ).runCommand({replSetReconfig: config})); replTest.awaitReplication(); replTest.awaitNodesAgreeOnConfigVersion(); replTest.stopSet(); }());
    • Repl 2020-08-10, Repl 2020-09-07, Repl 2020-11-02, Repl 2020-11-16
    • 50

      Consider the following scenario.

      1. Start a 3 node replica set, with one arbiter node. Assume the hostnames for the nodes are
        • localhost:10000 (primary)
        • localhost:10001 (secondary)
        • localhost:10002 (arbiter)
      2. Shut down the arbiter node, and restart it as part of the same replica set, but on a different port, say 20000. It's hostname is now localhost:20000.
      3. When the arbiter starts up, it will try to load it's previously persisted replica set config, with the original hostnames listed above. In ReplicationCoordinatorImpl::_finishLoadLocalConfig it will call validateConfigForStartUp and try to find itself in the config by calling findSelfInConfig in repl_set_config_checks.cpp.
      4. Since its hostname is now different than the one in the original config, it will fail to find itself, and so in _finishLoadLocalConfig we will report its index as -1.
      5. We will then check to see if this node is an arbiter in order to avoid starting data replication, here. However, if we don't find ourselves in the config, we never consider the node an arbiter. So, we will then try to start data replication.

      The fundamental issue is that we should not be starting data replication if we are an arbiter. See the attached repro script help_6042.js for an example of how this can manifest in an invariant failure. Basically, we are able to attempt an initial sync as an arbiter and then when we finish initial sync we crash because we are not in the expected STARTUP2 state.

      To fix this, one approach may be to never start data replication if we can't find ourselves in the local replica set config. If we can't find ourselves in the config, we should enter the REMOVED state, and we shouldn't need to start replicating until we become a proper member of the replica set. We could perhaps rely on the ReplicationCoordinatorImpl::_heartbeatReconfigStore to make sure that we start data replication whenever we receive a heartbeat that brings us back as a valid node into the config. It already has a check for whether or not we should start data replication.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            william.schultz@mongodb.com William Schultz (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated:
              Resolved: