[SERVER-4709] Strange timing of operations reported in logs Created: 18/Jan/12 Updated: 29/Feb/12 Resolved: 25/Jan/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Logging |
| Affects Version/s: | 2.0.2 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Aristarkh Zagorodnikov | Assignee: | Ben Becker |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Description |
|
While checking logs for Sat Jan 14 02:15:37 [initandlisten] connection accepted from 10.2.1.200:48055 #108356 update: { $unset: { h: 1 }, $set: { e: null, s: 1321172663 } } keyUpdates:1 1271310315ms This part of log shows the update running for something around 14 days, but entire connection existed less than a minute, looks like there's something amiss. |
| Comments |
| Comment by Ben Becker [ 25/Jan/12 ] | |
|
I'm going to close this issue for now. Please reopen if you encounter this issue again. | |
| Comment by Ben Becker [ 23/Jan/12 ] | |
|
| |
| Comment by Aristarkh Zagorodnikov [ 22/Jan/12 ] | |
|
I tend to blame virtualization, because we ran the same workload on physical hardware, I checked logs and there was no similar problems. | |
| Comment by Aristarkh Zagorodnikov [ 22/Jan/12 ] | |
|
Will do next week. | |
| Comment by Ben Becker [ 22/Jan/12 ] | |
|
It's worth noting that mongo does not use a monotonic timer for most stats, and thus any change in system time will affect the results. This can happen due to a Xen live migration, system suspend, NTP time change, user-initiated time change, etc... If this issue is on-going (and you have the time), could you try building the attached test.c file and let it run while the issue appears? It runs gettimeofday() and clock_gettime() every ~50ms, and shouldn't use too much of the CPU. Please post back any output from this program, or if you observe another invalid time while the program does not report anything. To build and run (assuming test.c is in the current directory):
Thanks for the report; I'll see if we can move towards safer, monotonic timing functions in the future. | |
| Comment by Aristarkh Zagorodnikov [ 21/Jan/12 ] | |
|
This is a custom built Xen kernel, and no source is available currently since we don't control the environment (we lease the environment, think small Amazon-kind cloud that sells servcies for us, we even don't know what kind of hardware runs this). According to this we can't force them to update the kernel. I would check the logs as you asked, but don't hold your breath. | |
| Comment by Ben Becker [ 21/Jan/12 ] | |
|
Hm, I haven't found anything in the mongo source code that could be causing the strange time values (yet). This sounds like a bug in the kernel or hypervisor. Is the source and config for this kernel (v2.6.32.36-228-scalaxy) available? Is this OS running under KVM, Xen or another hypervisor? There have been several patches related to invalid time values in the linux-stable 2.6.32.y branch since 2011-04-14, and both KVM and Xen have had significant TSC-related bugs in the past (some examples: Ubuntu on Xen bug, KVM kvmclock() jump bug, KVM kvmclock() backwards bug). Because the bogus log entries are transient, I would suspect an issue related to reading the TSC from another processor or vcpu. Can you check the kernel log (dmesg) for anything related to HPET, TSC, CLOCK, or skewing? Do you have access to the hypervisor/VMM/Dom0 host? There may also be some useful information in /proc/timer_list and /proc/cpuinfo. Also, please note that certain processors (e.g. AMD K8 Rev.A-E) have known issues with APIC timers, HPET and/or TSC values. I've had a test running for about 8 hours now in a virtualized Ubuntu 10.04 box (v2.6.32-36 on Parallels) that constantly runs curTimeMicros64() with no anomalies so far. Is there any additional information you could provide that might help to reproduce this issue? | |
| Comment by Aristarkh Zagorodnikov [ 21/Jan/12 ] | |
|
xm@dbs1-d:~$ uname -a xm@dbs1-d:~$ mongo , }, , , , }, , , , , , , }, , , { "_id" : 2, "host" : "dbs2-d.dc2.bru:27017" } ] Nothing specific (alarming) found in logs, plain connects/disconnects, occasional 100-200ms queries. | |
| Comment by Ben Becker [ 21/Jan/12 ] | |
|
Could you provide the output of the shell command uname -a and the output of the mongo command db.serverStatus()? Also, is this a single-server deployment or are there replica sets and/or multiple shards? Is there any additional log output near what was initially provided? | |
| Comment by Aristarkh Zagorodnikov [ 18/Jan/12 ] | |
|
xm@dbs1-d:~$ mongod --version | |
| Comment by Eliot Horowitz (Inactive) [ 18/Jan/12 ] | |
|
Are you sure this is 2.0.2? | |
| Comment by Aristarkh Zagorodnikov [ 18/Jan/12 ] | |
|
All of them are either insert or update operations, no queries. |