Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: 2.4.6, 2.5.1
Affects Version/s: None
Component/s: Replication
Labels:
- buildbot
Environment:
buildbot: Linux 64-bit DEBUG, Linux 64-bit debug dur off

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

MongoDB Status as of September 30th, 2013

ISSUE SUMMARY
When a secondary requests the next batch of the oplog from the primary, it holds an internal lock while waiting for the data to come back over the network. This same lock is required to service heartbeat requests. High latency and other network issues between nodes can cause the next batch of oplog data to take some time to retrieve, resulting in heartbeat requests timing out.

USER IMPACT
This issue can result in repeated and unnecessary replica set failover. It is present in versions of MongoDB prior to and including v2.4.5.

SOLUTION
The issue has been resolved by not holding the bgsync mutex while waiting for the network.

WORKAROUNDS
Improving the latency and reliability of your network will help to alleviate symptoms.

PATCHES
Production release v2.4.6 contains the fix for this issue, and production release v2.6.0 will contain the fix as well.

Detailed description:

bgsync::produce holds the BackgroundSync::_mutex through the call to r.tailingQueryGTE, which fetches the next batch of data from the primary's oplog. If it takes a long time to get a response from the primary then heartbeats may start timing out as heartbeats also require getting the BackgroundSync::_mutex. The fix is to change bgsync::produce to call r.tailingQueryGTE outside of the _mutex lock.

Initial description by Eric on June 28:

Now that we've pretty much fixed all the failing unit tests, we can see that zBigMapReduce is failing on the Linux Debug builder.
http://buildlogs.mongodb.org/Linux%2064-bit%20DEBUG/builds/2264/test/recent%20failures/zbigMapReduce.js

The problem seems to be that late in the m/r phase, the network seems to break such that the primary and the secondary can no longer see each other (send and recv time out), which causes the primary to step down. I have no idea what would be causing this.

This failure has been visible since Linux 64-bit DEBUG Build #2260 on June 27, but likely was hidden by simpler bugs. The last green Linux 64-bit DEBUG build was #2200 on June 13 (SHA1 86e76e34e88c).

http://buildbot.10gen.cc/builders/Linux%2064-bit%20DEBUG?numbuilds=100

It is also visible in Linux 64-bit debug dur off builds since #2440 on June 29. Last green build on this builder was #2438 (SHA1 babd275f8818)

http://buildbot.10gen.cc/builders/Linux%2064-bit%20debug%20dur%20off?numbuilds=50

Assignee:: Eric Milkie
Reporter:: Matt Kangas (Inactive)
Participants:: auto, Eric Milkie, Matt Kangas
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jul 03 2013 04:20:26 AM UTC
Updated:: Jul 11 2016 05:39:53 PM UTC
Resolved:: Jul 09 2013 11:57:33 AM UTC

Details

Description

Attachments

Forms

Activity

People

Dates