[SERVER-24360] opdate higher in secondary Created: 02/Jun/16  Updated: 14/Jul/16  Resolved: 20/Jun/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.8
Fix Version/s: None

Type: Question Priority: Trivial - P5
Reporter: adrian Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

I am using this check https://github.com/mzupan/nagios-plugin-mongodb/blob/master/check_mongodb.py to monitor replication lag between nodes in a replica set.

From time to time I have observed big lags between nodes (> 1000s).
I was guessing it could be a bug of the check, so I executed a second script obtaining info from the replica set when the check fails, and it shows this:

MongoDB shell version: 2.6.8
connecting to: test
{
        "set" : "cyclops",
        "date" : ISODate("2016-06-02T09:12:38Z"),
        "myState" : 2,
        "syncingTo" : "NODE11:27017",
        "members" : [
                {
                        "_id" : 0,
                        "name" : "NODE10:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 26251817,
                        "optime" : Timestamp(1464858757, 28),
                        "optimeDate" : ISODate("2016-06-02T09:12:37Z"),
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "NODE11:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 26096062,
                        "optime" : Timestamp(1464858306, 3),
                        "optimeDate" : ISODate("2016-06-02T09:05:06Z"),
                        "lastHeartbeat" : ISODate("2016-06-02T09:12:36Z"),
                        "lastHeartbeatRecv" : ISODate("2016-06-02T09:12:37Z"),
                        "pingMs" : 0,
                        "electionTime" : Timestamp(1438762714, 1),
                        "electionDate" : ISODate("2015-08-05T08:18:34Z")
                },
                {
                        "_id" : 2,
                        "name" : "NODE12:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 26251812,
                        "optime" : Timestamp(1464858306, 3),
                        "optimeDate" : ISODate("2016-06-02T09:05:06Z"),
                        "lastHeartbeat" : ISODate("2016-06-02T09:12:36Z"),
                        "lastHeartbeatRecv" : ISODate("2016-06-02T09:12:37Z"),
                        "pingMs" : 1,
                        "syncingTo" : "NODE011:27017"
                }
        ],
        "ok" : 1
}

The node executing the check:
"optimeDate" : ISODate("2016-06-02T09:12:37Z"),

Primary node:
"optimeDate" : ISODate("2016-06-02T09:05:06Z"),

So lag is -451s!

Both nodes are ntp synced.



 Comments   
Comment by Ramon Fernandez Marina [ 20/Jun/16 ]

adrianlzt, please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. Questions involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by adrian [ 02/Jun/16 ]

But the info about primary is:
"optimeDate" : ISODate("2016-06-02T09:05:06Z"),
"lastHeartbeat" : ISODate("2016-06-02T09:12:36Z"),
"lastHeartbeatRecv" : ISODate("2016-06-02T09:12:37Z"),

With this data I understant that secondary have received info about primary at 09:12:37Z, but optime wasn't updated.

Anyways, I have improved the script to collect data so it ssh in every host and print the state.

Thanks for the clarifitacion about the view of rs.status

Comment by Eric Milkie [ 02/Jun/16 ]

The output of the replSetGetStatus command is the data about the set from the point of view of the node where you are running the command; it is not an atomic point in time across all nodes. There is, in fact, no way to atomically query all nodes in a replica set at an exact moment in time, so there isn't a way to calculate lag exactly at any moment in time.
In the example you show, it looks like the node you queried hadn't updated its info from the primary in a while. You can improve your script get a better picture if you run the command on every node in turn and only use the data for the member where the "self" field is true.

Generated at Thu Feb 08 04:06:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.