Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.11, 4.0.22, 3.6.22, 4.4.3, 5.0.0-rc0
Affects Version/s: 3.6.3, 3.7.2
Component/s: Replication
Labels:
- former-quick-wins

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.4, v4.2, v4.0, v3.6, v3.4
Steps To Reproduce:
Hide

/* * An arbiter that is stopped and restarted on a different port and rejoins the * replica set causes an invariant failure because it tries to do an initial sync. See HELP-6042. */ (function() { "use strict"; var replTest = new ReplSetTest({name: 'test', nodes: 3}); replTest.startSet(); var nodes = replTest.nodeList(); var config = { "_id": "test", "members": [ {"_id": 0, "host": nodes[0]}, {"_id": 1, "host": nodes[1]}, {"_id": 2, "host": nodes[2], arbiterOnly: true} ] }; replTest.initiate(config); let primary = replTest.getPrimary(); replTest.awaitReplication(); replTest.awaitSecondaryNodes(); var arbiterId = 2; var newPort = 45678; jsTestLog("Restarting the arbiter node on a new port: " + newPort); replTest.stop(arbiterId); replTest.start(arbiterId, {port: newPort}, true); jsTestLog("Reconfiguring the set to include the arbiter on the new port."); var config = primary.getDB("local").system.replset.findOne(); jsTestLog("Original config:"); jsTestLog(tojson(config)); var hostname = config.members[arbiterId].host.split(":")[0]; config.version++; config.members[arbiterId].host = hostname + ":" + newPort; jsTestLog("New config:"); jsTestLog(tojson(config)); assert.commandWorked(primary.getDB("admin").runCommand({replSetReconfig: config})); replTest.awaitReplication(); replTest.awaitNodesAgreeOnConfigVersion(); replTest.stopSet(); }());
Show
/* * An arbiter that is stopped and restarted on a different port and rejoins the * replica set causes an invariant failure because it tries to do an initial sync. See HELP-6042. */ ( function () { "use strict" ; var replTest = new ReplSetTest({name: 'test' , nodes: 3}); replTest.startSet(); var nodes = replTest.nodeList(); var config = { "_id" : "test" , "members" : [ { "_id" : 0, "host" : nodes[0]}, { "_id" : 1, "host" : nodes[1]}, { "_id" : 2, "host" : nodes[2], arbiterOnly: true } ] }; replTest.initiate(config); let primary = replTest.getPrimary(); replTest.awaitReplication(); replTest.awaitSecondaryNodes(); var arbiterId = 2; var newPort = 45678; jsTestLog( "Restarting the arbiter node on a new port: " + newPort); replTest.stop(arbiterId); replTest.start(arbiterId, {port: newPort}, true ); jsTestLog( "Reconfiguring the set to include the arbiter on the new port." ); var config = primary.getDB( "local" ).system.replset.findOne(); jsTestLog( "Original config:" ); jsTestLog(tojson(config)); var hostname = config.members[arbiterId].host.split( ":" )[0]; config.version++; config.members[arbiterId].host = hostname + ":" + newPort; jsTestLog( "New config:" ); jsTestLog(tojson(config)); assert.commandWorked(primary.getDB( "admin" ).runCommand({replSetReconfig: config})); replTest.awaitReplication(); replTest.awaitNodesAgreeOnConfigVersion(); replTest.stopSet(); }());
Sprint:
Repl 2020-08-10, Repl 2020-09-07, Repl 2020-11-02, Repl 2020-11-16
Case:
Linked BF Score:
50
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Consider the following scenario.

Start a 3 node replica set, with one arbiter node. Assume the hostnames for the nodes are
- localhost:10000 (primary)
- localhost:10001 (secondary)
- localhost:10002 (arbiter)
Shut down the arbiter node, and restart it as part of the same replica set, but on a different port, say 20000. It's hostname is now localhost:20000.
When the arbiter starts up, it will try to load it's previously persisted replica set config, with the original hostnames listed above. In ReplicationCoordinatorImpl::_finishLoadLocalConfig it will call validateConfigForStartUp and try to find itself in the config by calling findSelfInConfig in repl_set_config_checks.cpp.
Since its hostname is now different than the one in the original config, it will fail to find itself, and so in _finishLoadLocalConfig we will report its index as -1.
We will then check to see if this node is an arbiter in order to avoid starting data replication, here. However, if we don't find ourselves in the config, we never consider the node an arbiter. So, we will then try to start data replication.

The fundamental issue is that we should not be starting data replication if we are an arbiter. See the attached repro script help_6042.js for an example of how this can manifest in an invariant failure. Basically, we are able to attempt an initial sync as an arbiter and then when we finish initial sync we crash because we are not in the expected STARTUP2 state.

To fix this, one approach may be to never start data replication if we can't find ourselves in the local replica set config. If we can't find ourselves in the config, we should enter the REMOVED state, and we shouldn't need to start replicating until we become a proper member of the replica set. We could perhaps rely on the ReplicationCoordinatorImpl::_heartbeatReconfigStore to make sure that we start data replication whenever we receive a heartbeat that brings us back as a valid node into the config. It already has a check for whether or not we should start data replication.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

help_6042.js
2 kB
Mar 08 2018 02:59:46 PM UTC

causes

SERVER-53345 Excuse arbiter_new_hostname.js from multiversion tests

Closed

related to

SERVER-52680 Removed node on startup stuck in STARTUP2 after being re-added into the replica set

Closed

SERVER-53026 Secondary cannot restart replication

Closed

Assignee:: A. Jesse Jiryu Davis
Reporter:: Will Schultz
Participants:: A. Jesse Jiryu Davis, Ali Mir, Githook User, Judah Schvimer, Louisa Berger, Siyuan Zhou, Will Schultz
Votes:: 0 Vote for this issue
Watchers:: 18 Start watching this issue

Created:: Mar 08 2018 03:07:39 PM UTC
Updated:: Oct 29 2023 10:34:00 PM UTC
Resolved:: Nov 02 2020 11:14:26 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates