Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.4.9, 7.0.11, 8.0.19
Component/s: None
Labels:
- heartbeat
- replica-set

Assigned Teams:

Server Triage
Operating System:
ALL
Steps To Reproduce:
Hide

Create the setup mentioned above, by applying iptable rule (added below) to 1 node of the replica set, in order to asymmetrically partition it from 2 other nodes:

sudo iptables -A OUTPUT -d <another_secondary_node_ip> -p tcp --dport <mongo_port> -j DROP
Show
Create the setup mentioned above, by applying iptable rule (added below) to 1 node of the replica set, in order to asymmetrically partition it from 2 other nodes: sudo iptables -A OUTPUT -d <another_secondary_node_ip> -p tcp --dport <mongo_port> -j DROP
Sprint:
Server Triage 2026-04-27, Server Triage 2026-05-18
Story Points:
1
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Summary:

We observed an edge case in the heartbeat protocol for mongo, where if there are secondaries in a replica set which are in a network topology where we have an asymmetrical partition, meaning node A can reach node B, but node B cannot reach node A, we see a leaking of heartbeat threads.

This leak corresponds to 1 thread per heartbeat interval and over time can start contending on the shared _mutex in ReplicationCoordinator. Since this mutex is also shared with function calls needed by Oplog Fetcher we start seeing the secondaries which are impacted by this networking partition start lagging.{}

Setup:

Assume we have a replica set with a 5 node configuration. In that set let’s say Node 1 is the primary and rest all are secondaries. Assume due to some networking configuration issues (easy to simulate via iptable rules):

Node 2 cannot reach and establish TCP connections to Node 3 and Node 4
Node 3 and Node 4 though can establish TCP connections to the Node 2
All nodes can reach every other node except the limitation mentioned above
Assumes all nodes have recently been restarted

Events Leading to the Replication Lag:

With the above setup, the following events occur:

Every heartbeatIntervalMillis Node 3 and Node 4 send a heartbeat to Node
Node 2 receives it and while processing it here starts a heartbeat thread to fetch the latest config for the node sending heartbeat (it does not match since this node is recently restarted and has the config version as -1 for the Node 3 and Node 4 in its local copy)
The heartbeat started above times out due to the network partition and hence the config version remains unchanged. The thread now keeps trying the heartbeat calls.
The above loop of 1,2,3 keeps happening every heartbeatIntervalMillis and leads to 1 extra thread attempting heartbeats every time, again because the config version diff linked above always results in config changed to be true.
Over time these threads build up and start contending on the shared _mutex in ReplicationCoordinator, many times just scanning large lists to untrack and track the handles.
This starts contention on the lock and eventually Oplog Fetcher threads also suffer from the contention and we start seeing replication lag.

Questions:

Is there a reason why we cannot have a single heartbeat thread per target by having de-dupe checks here?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

repro_server_122751.sh
15 kB
Apr 08 2026 08:53:41 AM UTC

Assignee:: Chris Kelly
Reporter:: Deep Vyas
Participants:: Chris Kelly, Deep Vyas
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Mar 26 2026 01:02:20 PM UTC
Updated:: Apr 30 2026 04:08:19 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates