Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-10085

Heartbeats can time out due to high network latency while fetching oplog batches

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 2.4.6, 2.5.1
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
    • Environment:
      buildbot: Linux 64-bit DEBUG, Linux 64-bit debug dur off
    • Fully Compatible
    • ALL

      MongoDB Status as of September 30th, 2013

      When a secondary requests the next batch of the oplog from the primary, it holds an internal lock while waiting for the data to come back over the network. This same lock is required to service heartbeat requests. High latency and other network issues between nodes can cause the next batch of oplog data to take some time to retrieve, resulting in heartbeat requests timing out.

      This issue can result in repeated and unnecessary replica set failover. It is present in versions of MongoDB prior to and including v2.4.5.

      The issue has been resolved by not holding the bgsync mutex while waiting for the network.

      Improving the latency and reliability of your network will help to alleviate symptoms.

      Production release v2.4.6 contains the fix for this issue, and production release v2.6.0 will contain the fix as well.

      Detailed description:

      bgsync::produce holds the BackgroundSync::_mutex through the call to r.tailingQueryGTE, which fetches the next batch of data from the primary's oplog. If it takes a long time to get a response from the primary then heartbeats may start timing out as heartbeats also require getting the BackgroundSync::_mutex. The fix is to change bgsync::produce to call r.tailingQueryGTE outside of the _mutex lock.

      Initial description by Eric on June 28:

      Now that we've pretty much fixed all the failing unit tests, we can see that zBigMapReduce is failing on the Linux Debug builder.

      The problem seems to be that late in the m/r phase, the network seems to break such that the primary and the secondary can no longer see each other (send and recv time out), which causes the primary to step down. I have no idea what would be causing this.

      This failure has been visible since Linux 64-bit DEBUG Build #2260 on June 27, but likely was hidden by simpler bugs. The last green Linux 64-bit DEBUG build was #2200 on June 13 (SHA1 86e76e34e88c).


      It is also visible in Linux 64-bit debug dur off builds since #2440 on June 29. Last green build on this builder was #2438 (SHA1 babd275f8818)


            milkie@mongodb.com Eric Milkie
            matt.kangas Matt Kangas
            0 Vote for this issue
            5 Start watching this issue