[SERVER-46897] REMOVED node may never send heartbeat to fetch newest config Created: 16/Mar/20  Updated: 29/Oct/23  Resolved: 24/Mar/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.2.4, 4.4.0-rc0
Fix Version/s: 4.4.0-rc0, 4.0.20, 4.2.8, 4.7.0

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: safe-reconfig-related
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-45575 Add Javascript helpers for doing non ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0, v3.6
Steps To Reproduce:

var replTest = new ReplSetTest({nodes: 2});
const nodes = replTest.startSet();
replTest.initiateWithHighElectionTimeout();
 
var primary = replTest.getPrimary();
var secondary = replTest.getSecondary();
let config = replTest.getReplSetConfigFromNode();
let origConfig = Object.assign({}, config);
 
jsTestLog("Starting reconfigs.");
 
// Reconfig from {n0,n1} -> {n0}.
// n1 will now be REMOVED.
config.version++;
config.members = [origConfig.members[0]];
jsTestLog("Reconfiguring to members: " + tojsononeline(config.members.map(m => m._id)) +
          " with version: " + config.version);
assert.commandWorked(primary.adminCommand({replSetReconfig: config, maxTimeMS: 5000}));
 
// Wait for the config to propagate to n1 so it enters REMOVED.
sleep(4000);
 
// No-op reconfig from {n0} -> {n0}.
// n1 is still REMOVED.
config.version++;
jsTestLog("Reconfiguring to members: " + tojsononeline(config.members.map(m => m._id)) +
          " with version: " + config.version);
assert.commandWorked(primary.adminCommand({replSetReconfig: config, maxTimeMS: 5000}));
 
// Reconfig from {n0} -> {n0, n1}.
// n1 was previously REMOVED, but will now be added back in. It should be able to get the new config
// eventually.
config.version++;
config.members = [origConfig.members[0], origConfig.members[1]];
jsTestLog("Reconfiguring to members: " + tojsononeline(config.members.map(m => m._id)) +
          " with version: " + config.version);
assert.commandWorked(primary.adminCommand({replSetReconfig: config, maxTimeMS: 5000}));
 
replTest.awaitNodesAgreeOnConfigVersion();
replTest.stopSet();

Sprint: Repl 2020-04-06
Participants:
Linked BF Score: 36

 Description   

When a replica set node installs a config that it is not a member of, it enters the REMOVED state. While in REMOVED state, it keeps track of any other node that sends it a heartbeat request in a seed list. If a node n1 is currently REMOVED and receives a heartbeat from node n0, it will add n0 to its seed list. If n1 then learns of a newer config that it is still not a member of, it will install this config and cancel its outgoing heartbeats. It will not reschedule any heartbeats, though, since it is still REMOVED in its current config. It will also not clear its seed list, since that only happens when heartbeats are restarted. So, this means that the node is currently REMOVED, and its seed list contains node n0, and it is not heartbeating any other node. If n0 then executes a reconfig that adds n1 back into the set, n0 will never learn of it because it only schedules a heartbeat to fetch a config if its seed list set changes. It will remain on a stale config indefinitely. To fix this issue, we may want to clear the seed list any time a node installs a new config that it is not a member of.



 Comments   
Comment by Githook User [ 29/Jun/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-46897 Clear replication node seed list whenever we install a new config that we are not a member of

(cherry picked from commit c2b282a1ba5f4f87d59456912073398a504281dc)
Branch: v4.0
https://github.com/mongodb/mongo/commit/9e41b51edf59d75b745d0dd304a29287e7e87bf5

Comment by Githook User [ 27/May/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-46897 Clear replication node seed list whenever we install a new config that we are not a member of

(cherry picked from commit c2b282a1ba5f4f87d59456912073398a504281dc)
Branch: v4.2
https://github.com/mongodb/mongo/commit/85f1ed156770a7821d39ff1704327d4b9a7c41ab

Comment by Githook User [ 26/Mar/20 ]

Author:

{'name': 'William Schultz', 'username': 'will62794', 'email': 'william.schultz@mongodb.com'}

Message: SERVER-46897 Clear replication node seed list whenever we install a new config that we are not a member of

(cherry picked from commit c2b282a1ba5f4f87d59456912073398a504281dc)
Branch: v4.4
https://github.com/mongodb/mongo/commit/fd7e392d1fe417bfa94579d08840704271435e5b

Comment by William Schultz (Inactive) [ 25/Mar/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-46897 Clear replication node seed list whenever we install a new config that we are not a member of
Branch: master
https://github.com/mongodb/mongo/commit/c2b282a1ba5f4f87d59456912073398a504281dc

Comment by William Schultz (Inactive) [ 16/Mar/20 ]

Note that normally a REMOVED node shouldn't learn of a newer config in which it is still REMOVED, since an upstream node that installs a newer config won't be heartbeating a REMOVED node. When a node executes a reconfig from Ci to Cj, where some node is REMOVED in Ci but not in Cj, the node will execute a quorum check that sends a heartbeat to the REMOVED node. That node may then schedule a heartbeat to fetch Ci from the upstream node and install it. This is one case where a node may already be REMOVED but it fetches and installs a newer config where it is still REMOVED. This is what happens in the attached repro.

More specifically, if we move through the following configs:

C2      -> C3   -> C4   -> C5
{n0,n1} -> {n0} -> {n0} -> {n0, n1}

then n1 may still be on C3 when n0 executes the quorum check for the reconfig from C4 to C5. The heartbeat sent from n0 to n1 during this quorum check will prompt n1 to fetch C4 and install it, even though it's still REMOVED in C4.

Comment by William Schultz (Inactive) [ 16/Mar/20 ]

I tested this on 4.2 and it appears to also be an issue there. I have not tested on any earlier version but I expect this issue goes back at least a few versions.

Generated at Thu Feb 08 05:12:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.