[SERVER-2886] Many commands being executed for 1271310319ms, mongo Created: 01/Apr/11  Updated: 05/Oct/12  Resolved: 25/Jun/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.0
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: aleksey Assignee: Aaron Staple
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian Lenny 64-bit, 6 Gb RAM


Issue Links:
Duplicate
duplicates SERVER-4740 Use monotonic clock sources for Timer Closed
is duplicated by SERVER-5745 profiling shows queries during round ... Closed
Related
is related to SERVER-6550 db.currentOp() incorrectly reports in... Closed
Operating System: Linux
Participants:

 Description   

Server hosts two mongod instances:

1. First mongod contains single 256 Mb capped collections with 100-300 inserts/sec and one index (no replica set, no sharding).
2. Second mongod is a replica set member.

It has been working fine for about year, but since day before yesterday both instances began to lot of log messages like this:

query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:83 1271310319ms
insert db_name.coll_name 1271310319ms

or in case of replica set

query admin.$cmd ntoreturn:1 command:

{ replSetHeartbeat: "vsrv", v: 12, pv: 1, checkEmpty: false, from: "213.xxx.xxx.xxx:27017" }

reslen:132 1271310319ms

Servers with other replica set member doesn't produce this messages.

Maybe this issue connected with the fact that kernel began to log this messages to syslog:
Apr 1 00:51:37 vsrv-1 kernel: [41610.222721] BUG: soft lockup - CPU#0 stuck for 101s! [mongod:3853]



 Comments   
Comment by Aaron Staple [ 25/Jun/12 ]

We've seen some strange logs in the past where the clock is moving forward but reported op times are always the same large value. This could potentially be caused by certain types of flapping of the clock value but is still suspicious.

Nonetheless, use of a non monotonic clock seems like the most likely cause at this point. So I'm going to close this ticket and mark as a duplicate of SERVER-4740.

Comment by Kay Agahd [ 10/May/12 ]

You could fix it by using monotonic clock which is better suited than current time to measure time intervals. Monotonic clock is never negativ and often faster than calling gettimeofday.

Comment by Randolph Tan [ 09/May/12 ]

This probably has to do with our use of gettimeofday for measuring time. It appears that it is possible for the time returned from gettimeofday be backwards and this will give you negative time difference.

Here's a short program below that simulates CurOp::elapsedMillis and shows how -10msec can interpreted into 14days

int main(void) {
  unsigned long long ull = -10000; // -10msec
  int i = (int) (ull / 1000); // Interestingly, if you don't have parenthesis here, the output will show up as -10. (Our code has parens)
  cout << "i: " << i << endl; // This gives you an output of i: 1271310309
}

Comment by aleksey [ 08/Apr/11 ]

After we have allocated more RAM to this server, problem seems to be disappeared.

It's strange, because database is not very large and this server hosts only this db and few workers, that doesn't consume much RAM:

vsrv:SECONDARY> db.stats()
{
"db" : "getmyad_db",
"collections" : 38,
"objects" : 9636389,
"avgObjSize" : 215.48789946109483,
"dataSize" : 2076525224,
"storageSize" : 2762447616,
"numExtents" : 149,
"indexes" : 45,
"indexSize" : 1370229568,
"fileSize" : 8519680000,
"ok" : 1
}

vsrv:SECONDARY> db.serverStatus().mem
{
"bits" : 64,
"resident" : 4017,
"virtual" : 11084,
"supported" : true,
"mapped" : 10476
}

Comment by Scott Hernandez (Inactive) [ 02/Apr/11 ]

If you have OS/hardware problems that would most likely be the cause. Are there any other messages coming from the os?

Generated at Thu Feb 08 03:01:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.