[SERVER-3575] Stuck replication condition should be made very visible Created: 11/Aug/11  Updated: 11/Jul/16  Resolved: 22/Feb/12

Status: Closed
Project: Core Server
Component/s: Admin, Replication, Stability
Affects Version/s: None
Fix Version/s: 2.1.1

Type: New Feature Priority: Major - P3
Reporter: Mathieu Poumeyrol Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 2
Labels: replication, rn
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Participants:

 Description   

A stuck replica condition is only visible to monitoring software by checking for difference of oplog timestamp.

  • add a "replication lag" to the table at the top of _replSet might help.
  • the "skew" column in the /_replSet is particularly ambiguous: I was convinced it was the replication lag.
  • In the recent issue we had, the replica sync thread was clearly stuck. It was looping trying to replicate an impossible event. I guess that kind of condition could be detected and reported as part of the rs.status(). The "errMsg" semantic is not completely clear as far as I know (it is not reset by a full resync, for instance) so it is not easy to monitor.
  • What about something like an "acceptable-replica-lag" option on the node ? If the replica is behind a given number of seconds, it would switch to the "RECOVERING" mode. That would also prevent serving stale content.


 Comments   
Comment by Mathieu Poumeyrol [ 22/Feb/12 ]

Looks good. Thanks.

Comment by auto [ 22/Feb/12 ]

Author:

{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}

Message: Clock skew clarifiaction SERVER-3575
Branch: master
https://github.com/mongodb/mongo/commit/1e51e26a0de6a406a60244197cd0c2d078347a1b

Comment by auto [ 22/Feb/12 ]

Author:

{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}

Message: Add lag to _replSet page SERVER-3575
Branch: master
https://github.com/mongodb/mongo/commit/2eec128d9efac6fa383fbc466ed65c2259a4dd74

Comment by Kristina Chodorow (Inactive) [ 21/Nov/11 ]

Adding lag to /_replSet is very do-able.

> In the recent issue we had, the replica sync thread was clearly stuck. It was looping
> trying to replicate an impossible event. I guess that kind of condition could be detected
> and reported as part of the rs.status(). The "errMsg" semantic is not completely clear as
> far as I know (it is not reset by a full resync, for instance) so it is not easy to
> monitor.

There should be an error message about it in rs.status(). I'll look into making it clearer.

An acceptable-replica-lag option is a bigger request, I'd suggest making a separate feature request ticket for that if it's something you'd like to see.

Generated at Thu Feb 08 03:03:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.