Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3575

Stuck replication condition should be made very visible

      A stuck replica condition is only visible to monitoring software by checking for difference of oplog timestamp.

      • add a "replication lag" to the table at the top of _replSet might help.
      • the "skew" column in the /_replSet is particularly ambiguous: I was convinced it was the replication lag.
      • In the recent issue we had, the replica sync thread was clearly stuck. It was looping trying to replicate an impossible event. I guess that kind of condition could be detected and reported as part of the rs.status(). The "errMsg" semantic is not completely clear as far as I know (it is not reset by a full resync, for instance) so it is not easy to monitor.
      • What about something like an "acceptable-replica-lag" option on the node ? If the replica is behind a given number of seconds, it would switch to the "RECOVERING" mode. That would also prevent serving stale content.

            kristina Kristina Chodorow (Inactive)
            kali Mathieu Poumeyrol
            2 Vote for this issue
            4 Start watching this issue