-
Type: Improvement
-
Resolution: Won't Do
-
Priority: Major - P3
-
Affects Version/s: None
-
Labels:None
I recently discovered I have had the wrong impression about the uptime values in rs.status() (i.e. replSetGetStatus) output for a very long time (~3 years).
I thought the uptime was of the unix/windows process but this is only true for the uptime of the "self": true node, i.e. the node you execute rs.status() on.
For the other nodes it is the span of time since the first heartbeat returned from them. So if you restart a node and then run rs.status() on it the uptimes it reports will be reset from zero. But from other nodes they will have higher uptimes and only the restarted node has the small uptime.
The manual replSetGetStatus page currently says:
The "been online" description is vague, and its easy to see rs.status() output that accidentally affirms it means the common idea of uptime, i.e. of a process. Instead the manual should convey it starts with the heartbeat initiation logic and there is the context that it is relative to the member you're executing on.
I also see the second line is wrong. The member that returns the rs.status() data shows an uptime too, since 3.2 for certain or maybe even earlier. Of course a node doesn't have heartbeat data with itself, but in the code (ReplicationCoordinatorImpl::processReplSetGetStatus) I see it is calculated as 'now - serverGlobalParams.started'.
I suggest the description be changed to the following.
The uptime field shows how long heartbeats have been established to that node. For the member the command is being run on there is no heartbeat data so the time since last restart is displayed instead.
The value will reset when the process is restarted, so a node that was restarted an hour ago will report 3600 for itself and <= 3600 values for the other nodes it has connected to since restart.