[SERVER-19823] rs.printSlaveReplicationInfo() syncedTo field displays the epoch for unreachable secondaries Created: 07/Aug/15 Updated: 11/Sep/20 Resolved: 09/Sep/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.8.0 |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Ramon Fernandez Marina | Assignee: | Huayu Ouyang |
| Resolution: | Done | Votes: | 1 |
| Labels: | former-quick-wins, gm-ack, neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Sprint: | Repl 2020-09-07, Repl 2020-09-21 | ||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
A secondary in a 3-node replica set was terminated; the primary kept taking writes, and rs.printSlaveReplicationInfo() would show something like this in 2.6.10:
The secondary in 27018 is caught up, and the one in 27019 is behind the primary as expected, since it was terminated a minute and a half ago. In 3.0.5 and 3.1.6 however:
Note how the syncedTo date goes back to the epoch. |
| Comments |
| Comment by Githook User [ 11/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 09/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 31/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
You might want to know what the lag is for a node that is in state RECOVERING as well, even though you can't currently read from it. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 30/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sure. Though I'm thinking wouldn't the replset state be a better judge? If something is claiming to be in SECONDARY, but reporting an optime of 0, I'd really like to know about it. In which case, the question is: for which states is the optime meaningless/ignorable? Anything that can't accept reads, correct? Is that all of them except PRIMARY and SECONDARY (for varying reasons)? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 28/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think we can just fix "r(x)" to check for a non-0,0 Timestamp in the optime field, rather than just the presence of the optime field at all. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 28/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue here is that in 3.0/3.1, replSetGetStatus returns an optime of 0 (and so optimeDate of epoch) for unreachable members. Whereas in 2.6 it's the last optime that member was seen to have. The relevant lines are highlighted below. The same is true whether replSetGetStatus is run on a primary or secondary. So there doesn't seem to be any quick js-based fix. I would say this behaviour is accurate — the host is unreachable, so there's no way to know how far it is right now. In which case, the output of printSlaveReplicationInfo should probably be changed to reflect that the host is uncontactable, rather than obtusely claiming it's "infinitely behind" the primary. 2.6.10:
3.0/3.1:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 07/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kevin.pulo@10gen.com can you see if it was a format change in replSetGetStatus command, or if the js function itself changed? Perhaps there is something simple we can change in the javascript to improve this. |