[SERVER-12410] the lockTime value is bigger than the totalTime value in globalLock metrics of serverStatus() Created: 20/Jan/14  Updated: 20/May/15  Resolved: 23/May/14

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Jianfeng Xu Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS: rhel 5.5 x64
Mongodb: mongodb-linux-x86_64-2.4.8.tgz


Attachments: Text File db.serverStatus().txt     Text File locks.txt    
Issue Links:
Related
Operating System: Linux
Participants:

 Description   

In a three-nodes replicaset, one primary and two secondaries.
x:SECONDARY> rs.config()
{
"_id" : "x",
"version" : 7,
"members" : [

{ "_id" : 3, "host" : "10.136.24.36:27032", "priority" : 2 }

,

{ "_id" : 4, "host" : "10.136.24.38:27032" }

,

{ "_id" : 5, "host" : "10.136.24.24:27032" }

]
}

Using db.serverStatus().globalLock to show lock time on all secondary nodes. Node 10.136.24.24, the lockTime value is bigger than the totalTime value. Noe 10.136.24.38 is correct. So what's the problem? Any ideas. Thanks!

NODE 10.136.24.24
x:SECONDARY> db.serverStatus().globalLock
{
"totalTime" : NumberLong("1794581793000"),
"lockTime" : NumberLong("18452882937173"),
"currentQueue" :

{ "total" : 0, "readers" : 0, "writers" : 0 }

,
"activeClients" :

{ "total" : 1, "readers" : 1, "writers" : 0 }

}

NODE 10.136.24.38
x:SECONDARY> db.serverStatus().globalLock
{
"totalTime" : NumberLong("1796619409000"),
"lockTime" : NumberLong("9604218338"),
"currentQueue" :

{ "total" : 0, "readers" : 0, "writers" : 0 }

,
"activeClients" :

{ "total" : 2, "readers" : 2, "writers" : 0 }

}



 Comments   
Comment by Ramon Fernandez Marina [ 23/May/14 ]

Hi Jianfeng,

I haven't heard back from you for some time, so I'm going to mark this ticket as resolved. If this is still an issue for you, feel free to re-open and provide additional information.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 15/May/14 ]

Hi Jianfeng,

we haven't head from you in over two weeks. Have you had a chance to replace ntpdate with a NTP daemon and observe what happens? If you have, can you please confirm whether the behavior you describe goes away?

Comment by Ramon Fernandez Marina [ 29/Apr/14 ]

Hi Jianfeng,

I can get the totalTime counter to be lower than globalLock by setting the clock backwards a few seconds using the date command, but over time the the totalTime counter grows bigger than globalLock – so I'm wondering whether this counter discrepancy you're observing persists over time, or if it goes away after a few minutes.

In other words, if ntpdate makes the clock jump backwards (you can use "ntpdate -s" to see in syslog what ntpdate is doing) then I'd say what you're seeing is expected. If this is an issue for you I'd recommend you use a NTP daemon to make the clock drift and avoid these sudden jumps.

Comment by Jianfeng Xu [ 09/Apr/14 ]

Hi Stephen,

We use ntpdate to synchronise the clocks to a ntpd server on all the rs nodes in every three hours. Like this:
0 */3 * * * /usr/sbin/ntpdate 10.136.1.100;clock --systohc

And all the rs nodes are running on the real machine. Thanks!

Comment by Thomas Rueckstiess [ 07/Apr/14 ]

Hi Jianfeng,

Is this still an issue for you? If so, can you answer Stephen's questions above? We'd like to find out what could be causing the difference you are seeing and our current hypothesis is that it could be related to clock skew or small clock adjustments.

Thanks,
Thomas

Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ]

Hi Jianfeng,

I'm wondering if perhaps the odd times are due to clock skew/adjustments (SERVER-4740/SERVER-8475).

A few questions:

  • Are you using NTP to synchronise the clocks on these servers? If so, can you describe the method you use (ntpd, ntpdate, ...)?
  • Are these servers VMs? If so, what type of VM is being used (KVM, Xen, EC2, ... )?

Thanks,
Stephen

Comment by Jianfeng Xu [ 11/Feb/14 ]

Hi Dan
I restarted this node and the stats had be reset. But after a while, the problem is raised again. Can u analyze the problem by the attach db.serverStatus() output file.Thanks.

Comment by Daniel Pasette (Inactive) [ 21/Jan/14 ]

That output looks normal and the global lock time reported in the output is consistent (reported in "." of the locks section). I can't think of any known bugs related to this symptom. If everything is functioning normally aside from this stat, I don't think it's anything to worry about. The stats will only be reset with a restart of the server – please re-post if you see this behavior again.

Comment by Jianfeng Xu [ 21/Jan/14 ]

Please look at the attach file locks.txt. Thanks.

Comment by Daniel Pasette (Inactive) [ 20/Jan/14 ]

can you attach the results of db.serverStatus().locks as well?

Generated at Thu Feb 08 03:28:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.