bgsync::produce holds the BackgroundSync::_mutex through the call to r.tailingQueryGTE, which fetches the next batch of data from the primary's oplog. If it takes a long time to get a response from the primary then heartbeats may start timing out as heartbeats also require getting the BackgroundSync::_mutex. The fix is to change bgsync::produce to call r.tailingQueryGTE outside of the _mutex lock.
Initial description by Eric on June 28:
Now that we've pretty much fixed all the failing unit tests, we can see that zBigMapReduce is failing on the Linux Debug builder.
The problem seems to be that late in the m/r phase, the network seems to break such that the primary and the secondary can no longer see each other (send and recv time out), which causes the primary to step down. I have no idea what would be causing this.
This failure has been visible since Linux 64-bit DEBUG Build #2260 on June 27, but likely was hidden by simpler bugs. The last green Linux 64-bit DEBUG build was #2200 on June 13 (SHA1 86e76e34e88c).
It is also visible in Linux 64-bit debug dur off builds since #2440 on June 29. Last green build on this builder was #2438 (SHA1 babd275f8818)