Nodes Have Leaking Heartbeat Threads on an asymmetrical network partition

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 4.4.9, 7.0.11, 8.0.19
    • Component/s: None
    • Server Triage
    • ALL
    • Hide

      Create the setup mentioned above, by applying iptable rule (added below) to 1 node of the replica set, in order to asymmetrically partition it from 2 other nodes:

      sudo iptables -A OUTPUT -d <another_secondary_node_ip> -p tcp --dport <mongo_port> -j DROP 
      Show
      Create the setup mentioned above, by applying iptable rule (added below) to 1 node of the replica set, in order to asymmetrically partition it from 2 other nodes: sudo iptables -A OUTPUT -d <another_secondary_node_ip> -p tcp --dport <mongo_port> -j DROP
    • Server Triage 2026-04-27, Server Triage 2026-05-18
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary:

      We observed an edge case in the heartbeat protocol for mongo, where if there are secondaries in a replica set which are in a network topology where we have an asymmetrical partition, meaning node A can reach node B, but node B cannot reach node A, we see a leaking of heartbeat threads. 

      This leak corresponds to 1 thread per heartbeat interval and over time can start contending on the shared _mutex in ReplicationCoordinator. Since this mutex is also shared with function calls needed by Oplog Fetcher we start seeing the secondaries which are impacted by this networking partition start lagging.{}

       

      Setup:

      Assume we have a replica set with a 5 node configuration. In that set let’s say Node 1 is the primary and rest all are secondaries. Assume due to some networking configuration issues (easy to simulate via iptable rules):

      • Node 2 cannot reach and establish TCP connections to Node 3 and Node 4
      • Node 3 and Node 4 though can establish TCP connections to the Node 2
      • All nodes can reach every other node except the limitation mentioned above
      • Assumes all nodes have recently been restarted

      Events Leading to the Replication Lag:

      With the above setup, the following events occur:

      1. Every heartbeatIntervalMillis Node 3 and Node 4 send a heartbeat to Node
      2. Node 2 receives it and while processing it here starts a heartbeat thread to fetch the latest config for the node sending heartbeat (it does not match since this node is recently restarted and has the config version as -1 for the Node 3 and Node 4 in its local copy)
      3. The heartbeat started above times out due to the network partition and hence the config version remains unchanged. The thread now keeps trying the heartbeat calls.
      4. The above loop of 1,2,3 keeps happening every heartbeatIntervalMillis and leads to 1 extra thread attempting heartbeats every time, again because the config version diff linked above always results in config changed to be true.
      5. Over time these threads build up and start contending on the shared _mutex in ReplicationCoordinator, many times just scanning large lists to untrack and track the handles.
      6. This starts contention on the lock and eventually Oplog Fetcher threads also suffer from the contention and we start seeing replication lag.

      Questions:

      • Is there a reason why we cannot have a single heartbeat thread per target by having de-dupe checks here?

            Assignee:
            Chris Kelly
            Reporter:
            Deep Vyas
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: