[DOCS-11543] A new explanation of "uptime" in replSetGetStatus Created: 04/Apr/18  Updated: 30/Oct/23

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: Server_Docs_20231030

Type: Improvement Priority: Major - P3
Reporter: Akira Kurogane Assignee: Ravind Kumar (Inactive)
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:
Days since reply: 1 year, 14 weeks, 2 days ago
Epic Link: DOCSP-1769

 Description   

I recently discovered I have had the wrong impression about the uptime values in rs.status() (i.e. replSetGetStatus) output for a very long time (~3 years).

I thought the uptime was of the unix/windows process but this is only true for the uptime of the "self": true node, i.e. the node you execute rs.status() on.

For the other nodes it is the span of time since the first heartbeat returned from them. So if you restart a node and then run rs.status() on it the uptimes it reports will be reset from zero. But from other nodes they will have higher uptimes and only the restarted node has the small uptime.

The manual replSetGetStatus page currently says:

replSetGetStatus.members[n].uptime

The uptime field holds a value that reflects the number of seconds that this member has been online.

This value does not appear for the member that returns the rs.status() data.

The "been online" description is vague, and its easy to see rs.status() output that accidentally affirms it means the common idea of uptime, i.e. of a process. Instead the manual should convey it starts with the heartbeat initiation logic and there is the context that it is relative to the member you're executing on.

I also see the second line is wrong. The member that returns the rs.status() data shows an uptime too, since 3.2 for certain or maybe even earlier. Of course a node doesn't have heartbeat data with itself, but in the code (ReplicationCoordinatorImpl::processReplSetGetStatus) I see it is calculated as 'now - serverGlobalParams.started'.

I suggest the description be changed to the following.

The uptime field shows how long heartbeats have been established to that node. For the member the command is being run on there is no heartbeat data so the time since last restart is displayed instead.

The value will reset when the process is restarted, so a node that was restarted an hour ago will report 3600 for itself and <= 3600 values for the other nodes it has connected to since restart.



 Comments   
Comment by Education Bot [ 31/Oct/22 ]

Hello! This ticket has been closed due to inactivity. If you believe this ticket is still important, please reopen it and leave a comment to explain why. Thank you!

Generated at Thu Feb 08 08:03:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.