A stuck replica condition is only visible to monitoring software by checking for difference of oplog timestamp.
- add a "replication lag" to the table at the top of _replSet might help.
- the "skew" column in the /_replSet is particularly ambiguous: I was convinced it was the replication lag.
- In the recent issue we had, the replica sync thread was clearly stuck. It was looping trying to replicate an impossible event. I guess that kind of condition could be detected and reported as part of the rs.status(). The "errMsg" semantic is not completely clear as far as I know (it is not reset by a full resync, for instance) so it is not easy to monitor.
- What about something like an "acceptable-replica-lag" option on the node ? If the replica is behind a given number of seconds, it would switch to the "RECOVERING" mode. That would also prevent serving stale content.