[SERVER-2114] Don't use select timeouts for fast coarse timing Created: 18/Nov/10  Updated: 02/Aug/18  Resolved: 22/Jun/16

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Roger Binns Assignee: Andrew Morrow (Inactive)
Resolution: Done Votes: 36
Labels: polish, pull-request
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

64 bit Ubuntu 10.10


Attachments: File mongo-timer-fix-v1.patch     File mongo-timer-fix-v3.patch     File mongo-timer-fix-v4.patch    
Issue Links:
Depends
is depended on by SERVER-9818 Problems when starting MongoDB with >... Closed
Duplicate
is duplicated by SERVER-8381 Listener timer inaccuracies on non-Linux Closed
is duplicated by SERVER-23614 High CPU Load on Idle/Prod Server Closed
is duplicated by SERVER-8049 Very short timeout for select() call ... Closed
Related
related to SERVER-23243 Extract time-keeping from Listener Closed
related to SERVER-21538 Choose clock source for reading curre... Closed
related to SERVER-1279 Get more accurate elapsed time in the... Closed
is related to SERVER-15389 Cannot start mongod when opening too ... Closed
is related to SERVER-8939 consider making server not use sleepm... Closed
Backwards Compatibility: Fully Compatible
Operating System: Linux
Sprint: Platforms 16 (06/24/16)
Participants:
Linked BF Score: 0

 Description   

The mongod process causes wakeups 100 times per second as can be seen by calling strace on the process. This is done by calling having a one hundredth of a second timeout on a select call on the listening socket. These wakeups increase power consumption and utilization on virtual machines. It should be noted that this happens all the time even if there are no connections or any other activity happening. Powertop will consistently show mongod as a top offender. An explanation on the mailing list was given:

> One main reason is we need a somewhat coarse timer that is as fast as possible.
> So we use this one loop for incrementing that counter.
> This is much faster than using any system call for getting wall time.

Linux has has a system call for a number of years for getting a timer value - clock_gettime. Except for ancient 32 bit hardware this call is implemented using the VDSO mechanism - a page the kernel maps into process memory. clock_gettime is implemented without there being a system call at all! Heck they even added "a somewhat coarse timer that is as fast as possible"! http://lwn.net/Articles/342018/

You should be able to replace the timeouts and variable they were updating with calls to clock_getttime.



 Comments   
Comment by Andrew Morrow (Inactive) [ 22/Jun/16 ]

Hi watchers of this ticket -

I'm closing this ticket as 'done'. As of https://github.com/mongodb/mongo/commit/710159c9602a6738e6455cfb26bc2d70a0454ae2, the select timeout in the Listener class is no longer used for fast coarse timing.

The Listener does still have a select timeout, currently at 250ms. However, that select timeout is there not for timekeeping, but to work around the fact that closing a listening socket is not guaranteed to unblock select, which currently frustrates the shutdown logic for the Listener. We expect this to be a temporary situation, ultimately resulting in the listen path not having any timeout. This also represents an order of magnitude reduction in wakeups.

Note that we have not yet implemented SERVER-21538. Until we do so, the server will continue to wake up on a 10ms interval while sampling the clock from a background thread. Once we do implement SERVER-21538, the server will only sample on a 10ms schedule if the system clock is found to be expensive to read. Otherwise, if the system clock is cheap to read, the server will not consume power when idle just to implement time keeping. There may of course be other periodic wake-ups in the server, but probably none as aggressive as had been present for timekeeping.

If the primary reason for following this ticket was interest in reduced power/battery consumption in mongod or mongos, please add yourself as a watcher on SERVER-21538.

Also note that the patches posted here are unlikely to apply correctly after the recent changes to the listener.

Comment by Andy Schwerin [ 15/Jun/15 ]

I think the best way to move forward is to do a little more work abstracting the use of timers so that we can better select a time source based on the facilities available on the (virtual) hardware. renctan recently introduced a TickSource interface to be used for doing elapsed time measurement (rather than reading the calendar time). If we got consistent about using that for elapsed time measurement, we could choose an implementation at start-up, and hang it off the ServiceContext. The question I've still got to answer is whether we need a slow-but-precise timer on systems where the fastest clock device isn't very precise. I suspect the answer is usually "no". That we rarely, if ever, need a precise measurement of time if the cost of that measurement is more than a few microseconds.

I'm hoping to spend a little time thinking about this later in the summer, but my plate is pretty full right now. I know that acm wants to kill the elapsed time tracker, too, because it's an extra job for the networking code that he'd rather it not have.

Comment by Ben McCann [ 11/Jun/15 ]

The issue that's been holding up the PR is that Mongo is still supporting RHEL 5 and its 2.6.18 kernel, but CLOCK_MONOTONIC_COARSE wasn't introduced until 2.6.32. Mongo is planning to continue support for RHEL 5 in at least the next major release series - 3.2, so that's not likely to change soon.

Any thoughts on moving forward? Should we work the current way for RHEL 5 and the new way for everyone else or would that be too messy?

Comment by Mark Callaghan [ 07/Feb/15 ]

A random anecdote, but in the past mongod has been the largest CPU consumer on my Ubunutu VM when some random binary depended on it but it wasn't doing anything.

Comment by Greg White [ 06/Feb/15 ]

That's a shame, because it make using mongo as part of a dev stack on a notebook particularly painful. On my laptop, it's the single largest power draw, even when idle.

Comment by Andrew Morrow (Inactive) [ 06/Feb/15 ]

Hi All -

I know there is a lot of interest in this ticket. We are holding off on changes in this area because we are currently evaluating the overall server network stack with an eye towards a significant refactoring or rewrite, much as has happened with other critical server subsystems during the 2.4, 2.6, and 3.0 release cycles. It is our expectation that the issue identified in this ticket would likely be tackled as part of such a refactoring, currently planned for the upcoming development cycle.

Thanks,
Andrew

Comment by Greg White [ 28/Jan/15 ]

3.0.0-rc6 has a serious regression in power usage. It's now using ~4x the power the 2.8-rc5 used - wakeups went from ~60 a second to to > 210 wakeups/sec.

Edit: That's using WiredTiger. Using the mmap storage engine, things are between 2-3x worse at ~140 wakeups/sec.

Comment by Vladimir Zoubritsky [ 25/Jun/14 ]

Applying the mongo-timer-fix patch with the patches from https://jira.mongodb.org/browse/SERVER-9580 has helped us reduce CPU usage on development machines. We have packaged the patches for Ubuntu at https://build.opensuse.org/package/show/home:dottedmag:mongodb/mongodb.

Comment by Ben McCann [ 27/Feb/14 ]

I've run into this being a problem as well. It would be really great to get this patch reviewed. It's a pretty small limited change.

Comment by Timo Lindemann [ 26/Feb/14 ]

I have been using Calvin's Patch since 2.4.6. My scenario is that I am a developer running mongodb on my work laptop.

(I asked Calvin for an update to the patch since it failed to apply in git master.)

The patch makes wakeups go from 200 per second to 20 per second for MongoDB, measured with powertop, and that is idle mongodb, just started up for like ten minutes, no ops at all, with version 2.4.6.

I realize my use case of mongodb eating my battery is perhaps only of minor importance to the grand scheme of things, but I humbly ask for another consideration to please include this patch. Maybe it'd be possible to make a compiler option out of it?

What exactly is holding the bug up and unresolved for that long of a time?

Thanks for consideration.

Comment by Axel Kittenberger [ 25/Feb/14 ]

Yay! hope it gets in there. fingers crossed

Just measured it on my notebook estimated battery runtime. its 5% longer without mongodb running - with no connected clients.

Otherwise is there a fool proof way to replace the mongod that came with debian (jessie here) by one compiled with your patch?

Comment by Calvin Owens [ 25/Feb/14 ]

Updating the patch to apply to the current git HEAD - I've heard from a couple developers who say they use it.

Comment by Roger Binns [ 01/May/13 ]

Note that your estimate is only taking into account when MongoDB is running on bare metal. When running virtualized/hypervisor then the host also has to dedicate cycles to this timer and wakeups and those cycles can't be used for useful work either.

Comment by Calvin Owens [ 28/Apr/13 ]

Update: https://http.snarkywidgets.com/mongo-timer-fix-v3.patch

Comment by Calvin Owens [ 21/Apr/13 ]

Also, some crude estimation using my desktop predicts that this fix would save you about 36.5 KWh/year per server, or about $5 at the average US price. In a large datacenter, that could be quite significant. YMMV.

Comment by Calvin Owens [ 21/Apr/13 ]

I've cooked up a fix for this, tested on Linux. Unfortunately, I don't have access to an OSX, Windows, or Solaris machine I can test this on. Can anybody here test on any of those platforms?

Patch (against current master): https://http.snarkywidgets.com/mongo-timer-fix-v1.patch

Remaining questions:

  • The performance hit on Linux is negligible - is that the case on other platforms? Linux is the biggest use case, so if not, do we care?
  • Is it necessary to handle overflow? A signed 64-bit integer will overflow at 300 million years counting milliseconds, but my understanding is that (at least on Linux) there is no guarantee that the counter will start at zero. Thoughts?
  • The counter is signed on every platform but OSX, where it is unsigned. But since on OSX it is directly tied to CPU cycles (and therefore starts from zero), I don't see how it could ever be large enough to matter - even if it counts clock cycles, on a 10GHz machine it would take 30 years to get there. Am I missing something?

I'm going to wait to submit a pull request until testing has been done on at least OSX and Windows.

(NOTE: I did try to test this on my iMac, but the build has been broken on 32-bit OSX for a long time, apparently)

Comment by Roger Binns [ 30/Dec/12 ]

There is no debate over whether this can be fixed technically and how to fix it. Commenters and voters believe it should be fixed. 10gen has chosen not to work on it yet.

If you fix it then that would be great!

Comment by Calvin Owens [ 30/Dec/12 ]

That's my point - why not use clock_gettime() or one of his friends? It was alluded to in the original post that an efficient alternative to repeatedly calling select() didn't exist.

getElapsedTimeMillis() could just call clock_gettime(CLOCK_MONOTONIC_RAW), which would perpetuate tiny latency increases, but would allow select() to block forever waiting for connections - a big win in terms of power consumption and CPU usage. As a bonus, you'd get a more accurate timer, since it would have nanosecond precision and wouldn't drift on the long side, as one based on select() will do over time.

Here's a benchmark of alternative timer methods: http://stackoverflow.com/questions/6498972/faster-equivalent-of-gettimeofday/13096917#13096917
Obviously that's oversimplified, but it gives you an idea that these calls really aren't expensive.

Comment by Roger Binns [ 30/Dec/12 ]

Each timer call is not inefficient by itself. The problem is the quantity of them - 200 per second even if not a single client is connected. This causes greatly increased power consumption - https://lesswatts.org/documentation/silicon-power-mgmnt/ and for virtualized environments means that the host also has to do work. In addition (on Linux) there is no need to run such a frequent timer since clock_gettime() can be used get a timer value on demand.

Run powertop to see just how bad mongodb is. And idle instance should have zero wakeups per second while a connected instance should have wakeups proportional to the amount of connection traffic.

Comment by Calvin Owens [ 30/Dec/12 ]

I would like to fix this. Could somebody clarify for me:
1) Why are the timer system calls are believed to be inefficient?
2) The loop checks the return of inShutdown() to end its thread. Why can't this be done via a signal? Even if the fundamental exit path isn't easy to change, the dbExit function can simply raise SIGUSR1 or suchlike, for which we can set a handler to do nothing, to make select() return and check inShutdown() again.

Thoughts?

Comment by HRJ [ 19/Dec/12 ]

My virtual machine is waking up 400 times/second. Inside the virtual machine, mongodb is waking up 200 times/second.

Please solve this for the love of our environment.

Comment by Vitaly A. Sorokin [ 08/Oct/12 ]

+1 for this bug. affects both my macbook and linode VM. it has been there for two years! just wondering...

Comment by Axel Kittenberger [ 08/Jul/12 ]

If this could be fixed! clock() returning the kernel jiffies sure isn't slower than doing a timeout every 10ms. Or just increase the timeouts and add the timer counter by larger values (even calculated)

On my Macbook air, mongodb is always a top CPU users and moves it to fan usage.

That mongodb is an all powerfull database is cool, but it would be even cooler if it would be scaleable as well, meaning to be able to scale down and be a not unnecessary powerdrain on mobile devices, when nothing is going on.

Comment by Teemu Ikonen [ 18/Jun/12 ]

This issue makes laptop development use with MongoDB bit harder. Mongod takes constantly few % of cpu and causes increased battery drainage.

Generated at Thu Feb 08 02:59:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.