-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: 4.4.9, 7.0.11, 8.0.19
-
Component/s: None
-
Server Triage
-
ALL
-
-
Server Triage 2026-04-27, Server Triage 2026-05-18
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary:
We observed an edge case in the heartbeat protocol for mongo, where if there are secondaries in a replica set which are in a network topology where we have an asymmetrical partition, meaning node A can reach node B, but node B cannot reach node A, we see a leaking of heartbeat threads.
This leak corresponds to 1 thread per heartbeat interval and over time can start contending on the shared _mutex in ReplicationCoordinator. Since this mutex is also shared with function calls needed by Oplog Fetcher we start seeing the secondaries which are impacted by this networking partition start lagging.{}
Setup:
Assume we have a replica set with a 5 node configuration. In that set let’s say Node 1 is the primary and rest all are secondaries. Assume due to some networking configuration issues (easy to simulate via iptable rules):
- Node 2 cannot reach and establish TCP connections to Node 3 and Node 4
- Node 3 and Node 4 though can establish TCP connections to the Node 2
- All nodes can reach every other node except the limitation mentioned above
- Assumes all nodes have recently been restarted
Events Leading to the Replication Lag:
With the above setup, the following events occur:
- Every heartbeatIntervalMillis Node 3 and Node 4 send a heartbeat to Node
- Node 2 receives it and while processing it here starts a heartbeat thread to fetch the latest config for the node sending heartbeat (it does not match since this node is recently restarted and has the config version as -1 for the Node 3 and Node 4 in its local copy)
- The heartbeat started above times out due to the network partition and hence the config version remains unchanged. The thread now keeps trying the heartbeat calls.
- The above loop of 1,2,3 keeps happening every heartbeatIntervalMillis and leads to 1 extra thread attempting heartbeats every time, again because the config version diff linked above always results in config changed to be true.
- Over time these threads build up and start contending on the shared _mutex in ReplicationCoordinator, many times just scanning large lists to untrack and track the handles.
- This starts contention on the lock and eventually Oplog Fetcher threads also suffer from the contention and we start seeing replication lag.
Questions:
- Is there a reason why we cannot have a single heartbeat thread per target by having de-dupe checks here?