[SERVER-3960] High CPU usage no IO Wait Created: 27/Sep/11  Updated: 15/Feb/13  Resolved: 17/Jan/13

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Julien Assignee: Unassigned
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian wheezy (I'm in testing cause I prefered to get packaged mongo, before the 2.0). It's a secondary behind a Mongo 1.8 in a replica set with an arbiter.


Issue Links:
Related
is related to SERVER-2212 Server automatically abort queries/co... Closed
Operating System: Linux
Participants:

 Description   

I'm running a fresh install of mongo 2.0.0 (git version 695c67df)
It works almost well :
1->30% usr cpu, 0% wait, load average : 0.01
Albeit it have some slow queries :

$ tail /var/log/mongodb/mongodb.log 
Tue Sep 27 11:21:33 [initandlisten] connection accepted from 10.0.2.1:58299 #89
Tue Sep 27 11:21:39 [conn24] query my.mesh ntoreturn:25 nscanned:12897 scanAndOrder:1 nreturned:1 reslen:199 101ms
Tue Sep 27 11:22:19 [conn7] query my.mesh ntoreturn:30 nscanned:35679 scanAndOrder:1 nreturned:30 reslen:3992 454ms
Tue Sep 27 11:23:05 [conn37] query my.mesh ntoreturn:30 nscanned:15470 scanAndOrder:1 nreturned:30 reslen:3920 169ms
Tue Sep 27 11:23:23 [conn20] query my.mesh ntoreturn:30 nscanned:12876 scanAndOrder:1 nreturned:30 reslen:3478 287ms
Tue Sep 27 11:24:13 [initandlisten] connection accepted from 10.0.2.1:36205 #90
Tue Sep 27 11:24:32 [conn26] query my.mesh ntoreturn:30 nscanned:30234 scanAndOrder:1 nreturned:30 reslen:3410 452ms
Tue Sep 27 11:24:39 [conn90] end connection 10.0.2.1:36205
Tue Sep 27 11:25:07 [clientcursormon] mem (MB) res:895 virt:37958 mapped:18792
Tue Sep 27 11:25:51 [conn6] query my.mesh ntoreturn:30 nscanned:9120 scanAndOrder:1 nreturned:30 reslen:3495 165ms

Some stats where all is running fine :

$ du -sh /srv/mongodb
25G
> db.mesh.count()
8641843
> db.mesh.totalIndexSize()
2512363440
> db.tinyurl.count()
557262
> db.tinyurl.totalIndexSize()
64526016
 
$ free
             total       used       free     shared    buffers     cached
Mem:       4056032    3883104     172928          0      34480    3712764
-/+ buffers/cache:     135860    3920172
Swap:            0          0          0
 
$ mongostat
connected to: 127.0.0.1
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn   set repl       time 
    *0     13     *0     *0       0     4|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     1k     3k    53 shape  SEC   11:36:28 
    *0     29     *0     *0       0     5|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     2k     5k    53 shape  SEC   11:36:29 
    *0      9     *0     *0       0     5|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   994b     3k    53 shape  SEC   11:36:30 
    *0     16     *0     *0       0     9|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     1k     5k    53 shape  SEC   11:36:31 
    *0     47     *0     *0       0     8|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     4k    16k    53 shape  SEC   11:36:32 
    *0      9     *0     *0       0     4|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   936b     3k    53 shape  SEC   11:36:33 
    *0     20     *0     *0       0     8|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     1k     5k    53 shape  SEC   11:36:34 
    *2      7     *0     *0       0     5|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   859b     3k    53 shape  SEC   11:36:35 
    *2      4     *0     *0       0     6|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   702b     3k    53 shape  SEC   11:36:36 
    *3      9     *0     *0       0     7|0       0  18.4g  37.3g   749m      1      0.5          0       0|0     0|0     1k     4k    53 shape  SEC   11:36:37 
 
$ iostat -x 2
Linux 3.0.0-1-amd64 (mongo)    09/27/2011      _x86_64_        (4 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          68.53    0.00    0.21    1.11    0.00   30.15
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.21    82.28   10.69    3.33   388.30   366.34   107.62     2.05  146.01    9.84  582.77   2.63   3.69
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.77    0.00    0.13    0.13    0.00   98.97
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    3.50    0.00    64.00     0.00    36.57     0.03    8.00    8.00    0.00   2.29   0.80
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.25    0.00    0.12    0.00    0.00   99.63
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00    0.00    0.00  100.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
 
 

It can run fine for some couples of minutes (~10, 20, depends), and suddenly eat up to 100% user CPU let the load average grow slowly to 600, here some stats where he's at 100% CPU and > 50 Load average

mongostat
connected to: 127.0.0.1
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn   set repl       time 
    *0      6     *0     *0       0     3|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    50|0   825b     2k   101 shape  SEC   11:54:50 
    *0     16     *0     *0       0     7|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    52|0     1k     4k   101 shape  SEC   11:54:51 
    *0      5     *0     *0       0     6|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0   952b     3k   101 shape  SEC   11:54:52 
    *0      9     *0     *0       0     8|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    50|0     1k     4k   101 shape  SEC   11:54:53 
    *0     15     *0     *0       0     4|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    50|0     1k     5k   101 shape  SEC   11:54:54 
    *0      9     *0     *0       0     9|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    52|0     1k     4k   101 shape  SEC   11:54:55 
    *0     12     *0     *0       0     5|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0     1k     6k   100 shape  SEC   11:54:56 
    *0     15     *0     *0       0     6|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0     1k     4k   100 shape  SEC   11:54:57 
    *0     16     *0     *0       0     9|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0     1k    31k   100 shape  SEC   11:54:58 
    *0     31     *0     *0       0     6|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    49|0     2k     6k   101 shape  SEC   11:54:59 
 
 
iostat -x 2
Linux 3.0.0-1-amd64 (mongo)    09/27/2011      _x86_64_        (4 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          68.81    0.00    0.21    1.08    0.00   29.90
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20    80.21   10.43    3.28   378.63   357.27   107.37     2.00  145.60    9.84  577.20   2.63   3.60
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         100.00    0.00    0.00    0.00    0.00    0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         100.00    0.00    0.00    0.00    0.00    0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          99.88    0.00    0.12    0.00    0.00    0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.50    0.00    1.50     0.00    16.00    21.33     0.00    0.00    0.00    0.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         100.00    0.00    0.00    0.00    0.00    0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
 
tail /var/log/mongodb/mongodb.log 
Tue Sep 27 11:56:20 [conn394] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 110ms
Tue Sep 27 11:56:21 [conn356] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 160ms
Tue Sep 27 11:56:21 [conn359] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:163 219ms
Tue Sep 27 11:56:21 [conn366] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 116ms
Tue Sep 27 11:56:21 [initandlisten] connection accepted from 10.0.2.1:58460 #452
Tue Sep 27 11:56:21 [initandlisten] connection accepted from 10.0.2.1:58461 #453
Tue Sep 27 11:56:21 [conn452] end connection 10.0.2.1:58460
Tue Sep 27 11:56:21 [conn417] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 192ms
Tue Sep 27 11:56:22 [conn366] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 166ms
Tue Sep 27 11:56:22 [conn426] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 648ms
 

I don't have hyperthreading nor NUMA.
I graphed it on munin and I can't see something interesting comparing period where mongo is find and period where it's slow :

  • Iostat is stable < 100 blocks read / s and < 20 blocks write / s
  • Btree stats do not vary
  • Curront connections go from ~50 to ~800, that is normal as every query went very slow
  • Mongo memory usage do not vary
  • MongoDB ops decrease (that is normal, as every ops take much time)
  • Write lock percentage is stable around 0.02%
  • vmstat running go from 0 to 800 and iowait remains at 0
  • When slow, cpu is like a barcode ! sometime not graphing sometime 100% user

cat /proc/cpuinfo gave 4 times :

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X3363  @ 2.83GHz
stepping        : 10
cpu MHz         : 2833.168
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm dts tpr_shadow vnmi flexpriority
bogomips        : 5666.33
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:



 Comments   
Comment by Eliot Horowitz (Inactive) [ 17/Jan/13 ]

Yes, the feature you want I believe is SERVER-2212

Comment by Julien [ 27/Sep/11 ]

The priority of this ticket should be low now as it's not a bug in the core.

Comment by Julien [ 27/Sep/11 ]

Was a sort that was eating CPU, so adding an index solved it !

The problem started as a query started to sort while a query was sorting, so sorting slowly, raising chances to see a 3rd request arriving and sorting and so on til 600 load average and a quasi-unresponsive server.

A parameter to limit the execution time of a query should help unlocking those problems ?

Generated at Thu Feb 08 03:04:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.