-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 3.6.3, 3.7.2
-
Component/s: Replication
-
Fully Compatible
-
ALL
-
v4.4, v4.2, v4.0, v3.6, v3.4
-
-
Repl 2020-08-10, Repl 2020-09-07, Repl 2020-11-02, Repl 2020-11-16
-
(copied to CRM)
-
50
Consider the following scenario.
- Start a 3 node replica set, with one arbiter node. Assume the hostnames for the nodes are
- localhost:10000 (primary)
- localhost:10001 (secondary)
- localhost:10002 (arbiter)
- Shut down the arbiter node, and restart it as part of the same replica set, but on a different port, say 20000. It's hostname is now localhost:20000.
- When the arbiter starts up, it will try to load it's previously persisted replica set config, with the original hostnames listed above. In ReplicationCoordinatorImpl::_finishLoadLocalConfig it will call validateConfigForStartUp and try to find itself in the config by calling findSelfInConfig in repl_set_config_checks.cpp.
- Since its hostname is now different than the one in the original config, it will fail to find itself, and so in _finishLoadLocalConfig we will report its index as -1.
- We will then check to see if this node is an arbiter in order to avoid starting data replication, here. However, if we don't find ourselves in the config, we never consider the node an arbiter. So, we will then try to start data replication.
The fundamental issue is that we should not be starting data replication if we are an arbiter. See the attached repro script help_6042.js for an example of how this can manifest in an invariant failure. Basically, we are able to attempt an initial sync as an arbiter and then when we finish initial sync we crash because we are not in the expected STARTUP2 state.
To fix this, one approach may be to never start data replication if we can't find ourselves in the local replica set config. If we can't find ourselves in the config, we should enter the REMOVED state, and we shouldn't need to start replicating until we become a proper member of the replica set. We could perhaps rely on the ReplicationCoordinatorImpl::_heartbeatReconfigStore to make sure that we start data replication whenever we receive a heartbeat that brings us back as a valid node into the config. It already has a check for whether or not we should start data replication.
- causes
-
SERVER-53345 Excuse arbiter_new_hostname.js from multiversion tests
- Closed
- related to
-
SERVER-52680 Removed node on startup stuck in STARTUP2 after being re-added into the replica set
- Closed
-
SERVER-53026 Secondary cannot restart replication
- Closed