[SERVER-19823] rs.printSlaveReplicationInfo() syncedTo field displays the epoch for unreachable secondaries Created: 07/Aug/15  Updated: 11/Sep/20  Resolved: 09/Sep/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.8.0

Type: Improvement Priority: Minor - P4
Reporter: Ramon Fernandez Marina Assignee: Huayu Ouyang
Resolution: Done Votes: 1
Labels: former-quick-wins, gm-ack, neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Backwards Compatibility: Fully Compatible
Sprint: Repl 2020-09-07, Repl 2020-09-21
Participants:
Case:

 Description   

A secondary in a 3-node replica set was terminated; the primary kept taking writes, and rs.printSlaveReplicationInfo() would show something like this in 2.6.10:

replset:PRIMARY> rs.printSlaveReplicationInfo()
source: skye.local:27018
        syncedTo: Fri Aug 07 2015 11:38:46 GMT-0400 (EDT)
        0 secs (0 hrs) behind the primary 
source: skye.local:27019
        syncedTo: Fri Aug 07 2015 11:37:17 GMT-0400 (EDT)
        89 secs (0.02 hrs) behind the primary

The secondary in 27018 is caught up, and the one in 27019 is behind the primary as expected, since it was terminated a minute and a half ago.

In 3.0.5 and 3.1.6 however:

replset:PRIMARY> rs.printSlaveReplicationInfo()
source: skye.local:27018
    syncedTo: Fri Aug 07 2015 11:38:46 GMT-0400 (EDT)
    102 secs (0.03 hrs) behind the primary 
source: skye.local:27019
    syncedTo: Wed Dec 31 1969 19:00:00 GMT-0500 (EST)
    1438962028 secs (399711.67 hrs) behind the primary 

Note how the syncedTo date goes back to the epoch.



 Comments   
Comment by Githook User [ 11/Sep/20 ]

Author:

{'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com'}

Message: SERVER-19823 rs.printSlaveReplicationInfo() syncedTo field displays the epoch for unreachable secondaries
Branch: svilen-optimizer-poc
https://github.com/mongodb/mongo/commit/1327ea636630ff2f98400a8de7d95ff0f1b8b344

Comment by Githook User [ 09/Sep/20 ]

Author:

{'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com'}

Message: SERVER-19823 rs.printSlaveReplicationInfo() syncedTo field displays the epoch for unreachable secondaries
Branch: master
https://github.com/mongodb/mongo/commit/1327ea636630ff2f98400a8de7d95ff0f1b8b344

Comment by Eric Milkie [ 31/Aug/15 ]

You might want to know what the lag is for a node that is in state RECOVERING as well, even though you can't currently read from it.

Comment by Kevin Pulo [ 30/Aug/15 ]

Sure. Though I'm thinking wouldn't the replset state be a better judge? If something is claiming to be in SECONDARY, but reporting an optime of 0, I'd really like to know about it. In which case, the question is: for which states is the optime meaningless/ignorable? Anything that can't accept reads, correct? Is that all of them except PRIMARY and SECONDARY (for varying reasons)?

Comment by Eric Milkie [ 28/Aug/15 ]

I think we can just fix "r(x)" to check for a non-0,0 Timestamp in the optime field, rather than just the presence of the optime field at all.

Comment by Kevin Pulo [ 28/Aug/15 ]

The issue here is that in 3.0/3.1, replSetGetStatus returns an optime of 0 (and so optimeDate of epoch) for unreachable members. Whereas in 2.6 it's the last optime that member was seen to have. The relevant lines are highlighted below. The same is true whether replSetGetStatus is run on a primary or secondary.

So there doesn't seem to be any quick js-based fix. I would say this behaviour is accurate — the host is unreachable, so there's no way to know how far it is right now. In which case, the output of printSlaveReplicationInfo should probably be changed to reflect that the host is uncontactable, rather than obtusely claiming it's "infinitely behind" the primary.

2.6.10:

> db.adminCommand("replSetGetStatus")
{
        "set" : "replset",
        "date" : ISODate("2015-08-28T08:39:03Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "genique:19823",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 56,
                        "optime" : Timestamp(1440751140, 1),
                        "optimeDate" : ISODate("2015-08-28T08:39:00Z"),
                        "lastHeartbeat" : ISODate("2015-08-28T08:39:01Z"),
                        "lastHeartbeatRecv" : ISODate("2015-08-28T08:39:01Z"),
                        "pingMs" : 0,
                        "lastHeartbeatMessage" : "syncing to: genique:19824",
                        "syncingTo" : "genique:19824"
                },
                {
                        "_id" : 1,
                        "name" : "genique:19824",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 61,
                        "optime" : Timestamp(1440751142, 1),
                        "optimeDate" : ISODate("2015-08-28T08:39:02Z"),
                        "electionTime" : Timestamp(1440751094, 1),
                        "electionDate" : ISODate("2015-08-28T08:38:14Z"),
                        "self" : true
                },
                {
                        "_id" : 2,
                        "name" : "genique:19825",
                        "health" : 0,
                        "state" : 8,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : Timestamp(1440751114, 1),
                        "optimeDate" : ISODate("2015-08-28T08:38:34Z"),
                        "lastHeartbeat" : ISODate("2015-08-28T08:38:59Z"),
                        "lastHeartbeatRecv" : ISODate("2015-08-28T08:38:35Z"),
                        "pingMs" : 0,
                        "syncingTo" : "genique:19824"
                }
        ],
        "ok" : 1
}

3.0/3.1:

> db.adminCommand("replSetGetStatus")
{
        "set" : "replset",
        "date" : ISODate("2015-08-28T08:37:37.268Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "genique:19823",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 40,
                        "optime" : Timestamp(1440751055, 1),
                        "optimeDate" : ISODate("2015-08-28T08:37:35Z"),
                        "lastHeartbeat" : ISODate("2015-08-28T08:37:36.576Z"),
                        "lastHeartbeatRecv" : ISODate("2015-08-28T08:37:37.009Z"),
                        "pingMs" : 0,
                        "syncingTo" : "genique:19824",
                        "configVersion" : 1
                },
                {
                        "_id" : 1,
                        "name" : "genique:19824",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 43,
                        "optime" : Timestamp(1440751056, 1),
                        "optimeDate" : ISODate("2015-08-28T08:37:36Z"),
                        "electionTime" : Timestamp(1440751016, 1),
                        "electionDate" : ISODate("2015-08-28T08:36:56Z"),
                        "configVersion" : 1,
                        "self" : true
                },
                {
                        "_id" : 2,
                        "name" : "genique:19825",
                        "health" : 0,
                        "state" : 8,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : Timestamp(0, 0),
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2015-08-28T08:37:36.577Z"),
                        "lastHeartbeatRecv" : ISODate("2015-08-28T08:37:32.789Z"),
                        "pingMs" : 0,
                        "lastHeartbeatMessage" : "Failed attempt to connect to genique:19825; couldn't connect to server genique:19825 (127.0.1.1), connection attempt failed",
                        "configVersion" : -1
                }
        ],
        "ok" : 1
}

Comment by Eric Milkie [ 07/Aug/15 ]

kevin.pulo@10gen.com can you see if it was a format change in replSetGetStatus command, or if the js function itself changed? Perhaps there is something simple we can change in the javascript to improve this.

Generated at Thu Feb 08 03:52:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.