If a runner performing a chunk migration cleanup yields, and during that time the node becomes non-primary, when the cleanup resumes the runner assumes the node is still primary and incorrectly attempts to write to the oplog, causing a fatal assertion.
The only configurations affected by this issue are sharded clusters where shards are replica sets, the balancer is enabled, and chunk migrations have occurred.
Under the conditions described above, the cleanup operation fails with an assert, and the primary node shuts down.
MongoDB 2.6 production releases up to 2.6.3 are affected by this issue.
The fix is included in the 2.6.4 production release.
During cleanup, always check the replica set status after yielding and abort the cleanup operation if the node is no longer primary.
The removeRange helper used by migration cleanup does not re-check replica set state after using a YIELD_AUTO cursor - if yielding and stepdown occurs, logOp() will fail (correctly) with an fassert().
We need to either not yield or re-check replica set state before deleting the document.
Affects v2.4, does not affect v2.7 due to changes in yield behavior.
- related to
SERVER-15798 Helpers::removeRange does not check if node is primary
SERVER-14261 stepdown during migration range delete can abort mongod
SERVER-16115 Helpers::removeRange should check if master