[SERVER-4709] Strange timing of operations reported in logs Created: 18/Jan/12  Updated: 29/Feb/12  Resolved: 25/Jan/12

Status: Closed
Project: Core Server
Component/s: Logging
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Ben Becker
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File test.c    
Issue Links:
Related
is related to SERVER-4740 Use monotonic clock sources for Timer Closed
Participants:

 Description   

While checking logs for SERVER-4100 causes (this case is unrelated, mentioned for information purposes only), I got some strange reconds on the primary:

Sat Jan 14 02:15:37 [initandlisten] connection accepted from 10.2.1.200:48055 #108356
Sat Jan 14 02:15:48 [conn108356] update models_stats.show_log query:

{ _id: ObjectId('4ebf7eb76b8dda061480d539') }

update: { $unset:

{ h: 1 }

, $set:

{ e: null, s: 1321172663 }

} keyUpdates:1 1271310315ms
Sat Jan 14 02:15:53 [conn108356] end connection 10.2.1.200:48055

This part of log shows the update running for something around 14 days, but entire connection existed less than a minute, looks like there's something amiss.
We have several (in the tens range) of such queries in logs with the same time (1271310315ms) on different connections with different operations.



 Comments   
Comment by Ben Becker [ 25/Jan/12 ]

I'm going to close this issue for now. Please reopen if you encounter this issue again.

Comment by Ben Becker [ 23/Jan/12 ]

SERVER-4740 may help with strange time results.

Comment by Aristarkh Zagorodnikov [ 22/Jan/12 ]

I tend to blame virtualization, because we ran the same workload on physical hardware, I checked logs and there was no similar problems.
I also would like to not that this effect was demonstrated twice, when system was under very high (unusually, to the point of can't-type-in-SSH) load.

Comment by Aristarkh Zagorodnikov [ 22/Jan/12 ]

Will do next week.

Comment by Ben Becker [ 22/Jan/12 ]

It's worth noting that mongo does not use a monotonic timer for most stats, and thus any change in system time will affect the results. This can happen due to a Xen live migration, system suspend, NTP time change, user-initiated time change, etc...

If this issue is on-going (and you have the time), could you try building the attached test.c file and let it run while the issue appears? It runs gettimeofday() and clock_gettime() every ~50ms, and shouldn't use too much of the CPU. Please post back any output from this program, or if you observe another invalid time while the program does not report anything.

To build and run (assuming test.c is in the current directory):

gcc ./test -otest -lrt && ./test

Thanks for the report; I'll see if we can move towards safer, monotonic timing functions in the future.

Comment by Aristarkh Zagorodnikov [ 21/Jan/12 ]

This is a custom built Xen kernel, and no source is available currently since we don't control the environment (we lease the environment, think small Amazon-kind cloud that sells servcies for us, we even don't know what kind of hardware runs this). According to this we can't force them to update the kernel.
Currently, the issue doesn't give us any problems, I just posted to inform you about the log inconsistencies.

I would check the logs as you asked, but don't hold your breath.

Comment by Ben Becker [ 21/Jan/12 ]

Hm, I haven't found anything in the mongo source code that could be causing the strange time values (yet). This sounds like a bug in the kernel or hypervisor. Is the source and config for this kernel (v2.6.32.36-228-scalaxy) available? Is this OS running under KVM, Xen or another hypervisor?

There have been several patches related to invalid time values in the linux-stable 2.6.32.y branch since 2011-04-14, and both KVM and Xen have had significant TSC-related bugs in the past (some examples: Ubuntu on Xen bug, KVM kvmclock() jump bug, KVM kvmclock() backwards bug).

Because the bogus log entries are transient, I would suspect an issue related to reading the TSC from another processor or vcpu. Can you check the kernel log (dmesg) for anything related to HPET, TSC, CLOCK, or skewing? Do you have access to the hypervisor/VMM/Dom0 host? There may also be some useful information in /proc/timer_list and /proc/cpuinfo. Also, please note that certain processors (e.g. AMD K8 Rev.A-E) have known issues with APIC timers, HPET and/or TSC values.

I've had a test running for about 8 hours now in a virtualized Ubuntu 10.04 box (v2.6.32-36 on Parallels) that constantly runs curTimeMicros64() with no anomalies so far. Is there any additional information you could provide that might help to reproduce this issue?

Comment by Aristarkh Zagorodnikov [ 21/Jan/12 ]

xm@dbs1-d:~$ uname -a
Linux dbs1-d.dc2.bru 2.6.32.36-228-scalaxy #1 SMP 2011-04-14 10:12:31 +0200 x86_64 GNU/Linux

xm@dbs1-d:~$ mongo
MongoDB shell version: 2.0.2
connecting to: test
PRIMARY> db.serverStatus()
{
"host" : "dbs1-d.dc2.bru",
"version" : "2.0.2",
"process" : "mongod",
"uptime" : 320149,
"uptimeEstimate" : 266481,
"localTime" : ISODate("2012-01-21T09:49:14.719Z"),
"globalLock" : {
"totalTime" : 320149677975,
"lockTime" : 417786069,
"ratio" : 0.001304971073663314,
"currentQueue" :

{ "total" : 0, "readers" : 0, "writers" : 0 }

,
"activeClients" :

{ "total" : 1, "readers" : 1, "writers" : 0 }

},
"mem" :

{ "bits" : 64, "resident" : 1992, "virtual" : 9925, "supported" : true, "mapped" : 4670, "mappedWithJournal" : 9340 }

,
"connections" :

{ "current" : 9, "available" : 15991 }

,
"extra_info" :

{ "note" : "fields vary by platform", "heap_usage_bytes" : 940880, "page_faults" : 1698 }

,
"indexCounters" : {
"btree" :

{ "accesses" : 70308, "hits" : 70308, "misses" : 0, "resets" : 0, "missRatio" : 0 }

},
"backgroundFlushing" :

{ "flushes" : 5330, "total_ms" : 468517, "average_ms" : 87.90187617260788, "last_ms" : 8, "last_finished" : ISODate("2012-01-21T09:48:55.927Z") }

,
"cursors" :

{ "totalOpen" : 1, "clientCursors_size" : 1, "timedOut" : 0 }

,
"network" :

{ "bytesIn" : 239991766, "bytesOut" : 4429271600, "numRequests" : 3389982 }

,
"repl" :

{ "setName" : "stat.drive.ru", "ismaster" : true, "secondary" : false, "hosts" : [ "dbs1-d.dc2.bru:27017", "dbs2-d.dc2.bru:27017" ], "arbiters" : [ "dbsa-d.dc2.bru:27023" ], "primary" : "dbs1-d.dc2.bru:27017", "me" : "dbs1-d.dc2.bru:27017" }

,
"opcounters" :

{ "insert" : 1001856, "query" : 92647, "update" : 83252, "delete" : 0, "getmore" : 1087289, "command" : 1124967 }

,
"asserts" :

{ "regular" : 0, "warning" : 0, "msg" : 0, "user" : 3, "rollovers" : 0 }

,
"writeBacksQueued" : false,
"dur" : {
"commits" : 84,
"journaledMB" : 0.114688,
"writeToDataFilesMB" : 0.02319,
"compression" : 0.9999912807679899,
"commitsInWriteLock" : 0,
"earlyCommits" : 0,
"timeMs" :

{ "dt" : 3032, "prepLogBuffer" : 0, "writeToJournal" : 8, "writeToDataFiles" : 0, "remapPrivateView" : 0 }

},
"ok" : 1
}
PRIMARY> rs.config()
{
"_id" : "stat.drive.ru",
"version" : 2,
"members" : [

{ "_id" : 0, "host" : "dbsa-d.dc2.bru:27023", "arbiterOnly" : true }

,

{ "_id" : 1, "host" : "dbs1-d.dc2.bru:27017", "priority" : 10 }

,

{ "_id" : 2, "host" : "dbs2-d.dc2.bru:27017" }

]
}

Nothing specific (alarming) found in logs, plain connects/disconnects, occasional 100-200ms queries.

Comment by Ben Becker [ 21/Jan/12 ]

Could you provide the output of the shell command uname -a and the output of the mongo command db.serverStatus()? Also, is this a single-server deployment or are there replica sets and/or multiple shards? Is there any additional log output near what was initially provided?

Comment by Aristarkh Zagorodnikov [ 18/Jan/12 ]

xm@dbs1-d:~$ mongod --version
db version v2.0.2, pdfile version 4.5
Thu Jan 19 01:39:50 git version: 514b122d308928517f5841888ceaa4246a7f18e3

Comment by Eliot Horowitz (Inactive) [ 18/Jan/12 ]

Are you sure this is 2.0.2?
There was a bug SERVER-3910 in 2.0.0 that was fixed in 2.0.1

Comment by Aristarkh Zagorodnikov [ 18/Jan/12 ]

All of them are either insert or update operations, no queries.

Generated at Thu Feb 08 03:06:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.