This is our Setup
2 Shards, each a ReplicaSet with 4 Nodes. 1 Node is dedicated for backups (priority:0,hidden:true)
If we start a backup we send the backup node the fsyncLock command and then start a rsync of the filesystem.
After we have finished the backup we send the fsyncUnLock command to the backup node.
If we have a master switch (due to upgrade or failure) in the ReplicaSet we encounter the problem that some or all slaves stops oplog syncing when the backup node starts the backup. It is exactly the same moment as we start the fsyncLock command, since the replication lag is the same for the backup nodes and the slaves which also stops syncing. When the backup is finished the other slaves also starts syncing again.
db.currentOp() doesn't show the fsyncLock on the slaves, only on the backup node.
To get rid of this problem we have to start the non backup slave. After this restart the slave runs well and never stop syncing again together with the backup node.
This is the second time we've encoutered this problem. Since this is our production environment we don't want to force a master switch if not needed.
It seems that the cause of this problem is the master switch in the replicaset.