[SERVER-33747] Arbiter tries to start data replication if cannot find itself in config after restart Created: 08/Mar/18  Updated: 29/Oct/23  Resolved: 02/Nov/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.3, 3.7.2
Fix Version/s: 4.2.11, 4.0.22, 3.6.22, 4.4.3, 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File help_6042.js    
Issue Links:
Backports
Depends
Problem/Incident
causes SERVER-53345 Excuse arbiter_new_hostname.js from m... Closed
Related
related to SERVER-52680 Removed node on startup stuck in STAR... Closed
related to SERVER-53026 Secondary cannot restart replication Closed
related to DOCS-11640 Remove no-journal tip for arbiters un... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0, v3.6, v3.4
Steps To Reproduce:

/*
 * An arbiter that is stopped and restarted on a different port and rejoins the
 * replica set causes an invariant failure because it tries to do an initial sync. See HELP-6042.
 */
(function() {
    "use strict";
    var replTest = new ReplSetTest({name: 'test', nodes: 3});
    replTest.startSet();
    var nodes = replTest.nodeList();
    var config = {
        "_id": "test",
        "members": [
            {"_id": 0, "host": nodes[0]},
            {"_id": 1, "host": nodes[1]},
            {"_id": 2, "host": nodes[2], arbiterOnly: true}
        ]
    };
    replTest.initiate(config);
 
    let primary = replTest.getPrimary();
    replTest.awaitReplication();
    replTest.awaitSecondaryNodes();
 
    var arbiterId = 2;
    var newPort = 45678;
 
    jsTestLog("Restarting the arbiter node on a new port: " + newPort);
    replTest.stop(arbiterId);
    replTest.start(arbiterId, {port: newPort}, true);
 
    jsTestLog("Reconfiguring the set to include the arbiter on the new port.");
    var config = primary.getDB("local").system.replset.findOne();
 
    jsTestLog("Original config:");
    jsTestLog(tojson(config));
 
    var hostname = config.members[arbiterId].host.split(":")[0];
    config.version++;
    config.members[arbiterId].host = hostname + ":" + newPort;
 
    jsTestLog("New config:");
    jsTestLog(tojson(config));
 
    assert.commandWorked(primary.getDB("admin").runCommand({replSetReconfig: config}));
    replTest.awaitReplication();
    replTest.awaitNodesAgreeOnConfigVersion();
 
    replTest.stopSet();
}());

Sprint: Repl 2020-08-10, Repl 2020-09-07, Repl 2020-11-02, Repl 2020-11-16
Participants:
Case:
Linked BF Score: 50

 Description   

Consider the following scenario.

  1. Start a 3 node replica set, with one arbiter node. Assume the hostnames for the nodes are
    • localhost:10000 (primary)
    • localhost:10001 (secondary)
    • localhost:10002 (arbiter)
  2. Shut down the arbiter node, and restart it as part of the same replica set, but on a different port, say 20000. It's hostname is now localhost:20000.
  3. When the arbiter starts up, it will try to load it's previously persisted replica set config, with the original hostnames listed above. In ReplicationCoordinatorImpl::_finishLoadLocalConfig it will call validateConfigForStartUp and try to find itself in the config by calling findSelfInConfig in repl_set_config_checks.cpp.
  4. Since its hostname is now different than the one in the original config, it will fail to find itself, and so in _finishLoadLocalConfig we will report its index as -1.
  5. We will then check to see if this node is an arbiter in order to avoid starting data replication, here. However, if we don't find ourselves in the config, we never consider the node an arbiter. So, we will then try to start data replication.

The fundamental issue is that we should not be starting data replication if we are an arbiter. See the attached repro script help_6042.js for an example of how this can manifest in an invariant failure. Basically, we are able to attempt an initial sync as an arbiter and then when we finish initial sync we crash because we are not in the expected STARTUP2 state.

To fix this, one approach may be to never start data replication if we can't find ourselves in the local replica set config. If we can't find ourselves in the config, we should enter the REMOVED state, and we shouldn't need to start replicating until we become a proper member of the replica set. We could perhaps rely on the ReplicationCoordinatorImpl::_heartbeatReconfigStore to make sure that we start data replication whenever we receive a heartbeat that brings us back as a valid node into the config. It already has a check for whether or not we should start data replication.



 Comments   
Comment by Githook User [ 20/Nov/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-33747 Fix crash when arbiter restarts and enters REMOVED

(cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b)
Branch: v4.4
https://github.com/mongodb/mongo/commit/e3c4ee9c2549c9cfe7d63c528e03488d6d7d2385

Comment by Githook User [ 20/Nov/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-33747 Fix crash when arbiter restarts and enters REMOVED

(cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b)
Branch: v3.6
https://github.com/mongodb/mongo/commit/9fcb1f6911dc7904163515f45b361a1333a55310

Comment by Githook User [ 09/Nov/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-33747 Fix crash when arbiter restarts and enters REMOVED

(cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b)
Branch: v4.0
https://github.com/mongodb/mongo/commit/d287524d1a5a0da345d2cb9fbfd8107b55f09002

Comment by Githook User [ 09/Nov/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-33747 Fix crash when arbiter restarts and enters REMOVED

(cherry picked from commit 72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b)
Branch: v4.2
https://github.com/mongodb/mongo/commit/3ed7e17fb826c2069b5572486a9c09d41997dc8a

Comment by Githook User [ 02/Nov/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-33747 Fix crash when arbiter restarts and enters REMOVED
Branch: master
https://github.com/mongodb/mongo/commit/72aacd4ffaf6500777a8a51f87b0797f8ea8ad0b

Comment by Siyuan Zhou [ 09/Sep/20 ]

judah.schvimer, you are right that this ticket describes a solution that we shouldn't start data replication in REMOVED state. However, that diverges from the current convention that data replication is always running in REMOVED state. As a result, huayu.ouyang found that her patch doesn't start data replication after a node starts up after restoring from a backup, which contains a config that doesn't contain the node, meaning the node is REMOVED. The two solutions mentioned above are for this new problem.

  1. Data replication is always disabled in REMOVED state. As Huayu mentioned, stop and restarting data replication is complex and out of scope.
  2. The other solution is to focus on the transition out of REMOVED state. A node only starts data replication when it's not started. Thus there's no 1-to-1 mapping between REMOVED state to data replication state. Since we assume an arbiter cannot become a secondary, this approach should work.
  3. Another idea is to check arbiter after data replication starts. I don't like the solution since it violates the layers.

I feel this ticket needs more investigation, so I'd suggest holding off this ticket. CC tess.avitabile. Huayu has other project tickets in this iteration.

Comment by Ali Mir [ 10/Aug/20 ]

In my investigation, I've verified that this bug exists on both master and 4.4. The arbiter will not be able to find itself in the config (after calling validateConfigForStartup and then findSelfInConfig()), and puts itself into the REMOVED state. Afterwards, it will attempt to start data replication, and an invariant will fail. Will's solution sounds reasonable to me. If the node is put into the REMOVED state and never starts data replication, it will remain REMOVED until the primary is issued a reconfig to re-add the node to the config, after which it can identify itself as an arbiter.

Comment by Siyuan Zhou [ 21/Jul/20 ]

ali.mir, could you please investigate this issue and see if it could happen on master and 4.4? The goal is to understand the problem and the solution proposal and try to reproduce it on master and 4.4.

Comment by Judah Schvimer [ 06/Jan/20 ]

This is marked as depends on "Safe Reconfig". We will reconsider this after that project completes.

Comment by Louisa Berger [ 19/Mar/18 ]

Not sure if this makes a difference in your backporting / etc, but I also just saw this on 3.4.13 on Debian.

	- Mongo Logs: 
connections now open to ip-10-122-27-34.ec2.internal:9008 with a 5 second timeout)
2018-03-16T16:17:47.159+0000 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Successfully connected to ip-10-122-27-34.ec2.internal:9007 (1 connections now open to ip-10-122-27-34.ec2.internal:9007 with a 5 second timeout)
2018-03-16T16:17:47.159+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to ip-10-122-27-34.ec2.internal:9008
2018-03-16T16:17:47.159+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to ip-10-122-27-34.ec2.internal:9007
2018-03-16T16:17:47.159+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to ip-10-122-27-34.ec2.internal:9007
2018-03-16T16:17:47.159+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to ip-10-122-27-34.ec2.internal:9008, took 0ms (1 connections now open to ip-10-122-27-34.ec2.internal:9008)
2018-03-16T16:17:47.160+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to ip-10-122-27-34.ec2.internal:9007, took 1ms (2 connections now open to ip-10-122-27-34.ec2.internal:9007)
2018-03-16T16:17:47.160+0000 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to ip-10-122-27-34.ec2.internal:9007, took 1ms (2 connections now open to ip-10-122-27-34.ec2.internal:9007)
2018-03-16T16:17:47.160+0000 I NETWORK  [shard registry reload] Starting new replica set monitor for a/ip-10-122-27-34.ec2.internal:9001,ip-10-122-27-34.ec2.internal:9002
2018-03-16T16:17:47.161+0000 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Successfully connected to ip-10-122-27-34.ec2.internal:9001 (1 connections now open to ip-10-122-27-34.ec2.internal:9001 with a 5 second timeout)
2018-03-16T16:17:47.162+0000 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document
2018-03-16T16:17:47.162+0000 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Successfully connected to ip-10-122-27-34.ec2.internal:9002 (1 connections now open to ip-10-122-27-34.ec2.internal:9002 with a 5 second timeout)
2018-03-16T16:17:49.161+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data/mci/3ba67845e7d056c1671532a0bcfad351/mms-automation/go_planner/tmp/output/data/process9003/diagnostic.data'
2018-03-16T16:17:49.161+0000 I REPL     [initandlisten] Replaying stored operations from { ts: Timestamp 1521217052000|1, t: 1 } (exclusive) to { ts: Timestamp 1521217062000|2, t: 1 } (inclusive).
2018-03-16T16:17:49.161+0000 F REPL     [initandlisten] Oplog entry at { ts: Timestamp 1521217052000|1, t: 1 } is missing; actual entry found is { ts: Timestamp 1521217062000|2, t: 1 }
2018-03-16T16:17:49.161+0000 I -        [initandlisten] Fatal Assertion 40292 at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp 659
2018-03-16T16:17:49.161+0000 I -        [initandlisten] 
 
***aborting after fassert() failure

Generated at Thu Feb 08 04:34:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.