[SERVER-41031] After an unreachable node is added and removed from the replica set, the other replica set members continue to send heartbeat to this removed node Created: 07/May/19  Updated: 06/Dec/22

Status: Open
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Linda Qin Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-36417 Drop pooled connections to nodes no l... Blocked
Related
related to SERVER-48975 Increase isSelf logging verbosity Closed
is related to SERVER-43632 Possible memory leak in 4.0 Closed
is related to SERVER-35649 Nodes removed due to isSelf failure s... Closed
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:

Steps to reproduce:
1. Start a replica set on 4.0.9
2. Connect to the primary, then run rs.add("NEW_HOST:2017"). (NEW_HOST is a server that the replica set members can't connect to).
3. There will be heartbeat failure to this node. This is expected.
4. Run rs.remove("NEW_HOST:2017") to remove it from the replica set.
5. After this node is removed from the replica set, the replica set members still send heartbeat to this node, which is unexpected.

Sprint: Repl 2019-07-01, Repl 2019-07-15, Repl 2019-07-29, Repl 2019-08-12, Repl 2019-09-23, Repl 2019-10-07, Repl 2019-10-21, Repl 2019-11-04
Participants:
Case:
Linked BF Score: 3

 Description   
  • This issue is reproducible on 4.0 but not on 3.6.
  • Stepping down the primary doesn't seem to stop those heartbeats to the removed node.
  • In 4.0, if NEW_HOST is reachable from the replica set members (no matter whether the mongod process is running on NEW_HOST or not), the issue is not reproducible.
     
    Reproduced on RHEL7. I didn't try other platforms, so not sure if this is platform specific.


 Comments   
Comment by Judah Schvimer [ 15/Aug/19 ]

We'll check which 4.1.x releases the test is relevant for, fix it on 4.0, and keep the test on master so we don't break it again.

Generated at Thu Feb 08 04:56:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.