ISSUE SUMMARY
During a compact operation, heartbeats to and from the node fail and the node can appear to be offline to the rest of the replica set ("... thinks that we are down..." log messages). This is because a global write lock is taken before checking if the node is running a compact.
USER IMPACT
Running compact makes the node appear to be down to other members, which can affect majority concerns in a replica set (write concern, voting, etc). Running compact on several secondaries in one replica set can cause loss of majority and a step-down of the primary node, rendering the replica set read-only.
SOLUTION
To fix the issue, the global write lock is not acquired if the node is running compact or other maintenance tasks.
WORKAROUNDS
Run compact only on one secondary at a time for each replica set, making sure that the quorum is not affected. The secondary can temporarily be replaced with an arbiter node to ensure high availability.
AFFECTED VERSIONS
All recent production release versions up to 2.4.9 are affected.
PATCHES
The fix is included in the 2.4.10 production release and the 2.5.5 development version, which will evolve into the 2.6.0 production release.