[SERVER-5401] high CPU usage, low disk IO, very slow response Created: 26/Mar/12  Updated: 08/Mar/13  Resolved: 18/Oct/12

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nemanja Dubravac Assignee: Scott Hernandez (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS release 6.2, Linux bs-linux64.10gen.cc 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
24 GB RAM, Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz, 64bit with 8 cores


Attachments: PNG File MMS_server01_a.png     PNG File MMS_server01_b.png     PNG File Screen Shot 2012-03-26 at 13.39.21.png     PNG File event-2012-6-19-8-18amUTC.png    
Operating System: Linux
Participants:

 Description   

disk utilization is low, but the CPU usage is almost 800% when finding documents (yes, we're using indexes on query fields)
we do a lot of searches on indexed fields and sometimes we do a lot of consecutive updates (sometimes on indexed fields) on documents by quering directly individuals _ids

when we do approx 3500 queries/s, we get around 600% CPU usage, and I heard that CPU usage for mongod should never be high.. the server load average stays low (around 0.05)

 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn                  set repl       time 
     0   3324      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0    13|0   384k   121k    65 my_replication    M   13:18:08 
     0   2401      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0     0|0   278k    88k    65 my_replication    M   13:18:09 
     0   3354      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0    13|0   387k   121k    65 my_replication    M   13:18:10 
     0   3089      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0    11|0   358k   112k    64 my_replication    M   13:18:11 
     0   3567      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0    10|0   414k   130k    64 my_replication    M   13:18:12 
     0   3603      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0    10|0   416k   131k    64 my_replication    M   13:18:13 
     0   3680      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0    11|0   424k   134k    64 my_replication    M   13:18:14 
     0   3637      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0     7|0   419k   132k    63 my_replication    M   13:18:15 
     0   2635      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0    12|0   302k    96k    63 my_replication    M   13:18:16 
     0   3659      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0     0|0   420k   133k    63 my_replication    M   13:18:17 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn                  set repl       time 
     0   3694      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0     8|0   423k   134k    63 my_replication    M   13:18:18 
     0   2993      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0     4|0   343k   109k    63 my_replication    M   13:18:19 
     0   3516      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0    10|0   404k   128k    63 my_replication    M   13:18:20 
     0   3687      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0    10|0   423k   134k    63 my_replication    M   13:18:22 
     0   3629      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0     7|0   415k   132k    63 my_replication    M   13:18:23 
     0   3622      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0     9|0   414k   131k    63 my_replication    M   13:18:24 
     0   3113      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0    10|0   356k   113k    62 my_replication    M   13:18:25 
     0   3722      0      0       0       1       0  22.2g  26.4g  7.21g      0        0          0       0|0     9|0   428k   135k    62 my_replication    M   13:18:26 
     0   2870      0      0       0       2       0  22.2g  26.4g  7.21g      0        0          0       0|0     0|0   331k   105k    62 my_replication    M   13:18:27 
     0   3361      0      0       0      16       0  22.2g  26.4g  7.21g      0        0          0       0|0     5|0   386k   128k    62 my_replication    M   13:18:28 

if we then start updating at the same time the queries are running, we get up to around 400 updates/s, but the queries drop to 300/s or even 0/s, as you can see from the MMS graph attached (graph does not show the data from the same timeframe as these logs, but the effect is the same)..

insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn                  set repl       time 
     0      3    490      0       0     507       0  22.2g  26.4g  7.22g      0     20.7          0       0|0     0|0   152k    65k    51 my_replication    M   13:46:50 
     0      0    493      0       0     495       0  22.2g  26.4g  7.22g      0     21.2          0       0|0     0|0   151k    59k    51 my_replication    M   13:46:51 
     0      0    500      0       0     501       0  22.2g  26.4g  7.22g      0     20.9          0       0|0     0|0   153k    59k    51 my_replication    M   13:46:52 
     0      0    500      0       0     502       0  22.2g  26.4g  7.22g      0     21.2          0       0|0     0|0   153k    60k    51 my_replication    M   13:46:53 
     0      0    485      0       0     486       0  22.2g  26.4g  7.22g      0     20.9          0       0|0     0|0   149k    58k    51 my_replication    M   13:46:54 
     0      0    485      0       0     486       0  22.2g  26.4g  7.22g      0     20.3          0       0|0     0|0   149k    58k    51 my_replication    M   13:46:55 
     0      0    472      0       0     473       0  22.2g  26.4g  7.22g      0     20.2          0       0|0     0|0   145k    56k    51 my_replication    M   13:46:56 
     0      0    481      0       0     484       0  22.2g  26.4g  7.22g      0     20.7          0       0|0     0|0   148k    57k    51 my_replication    M   13:46:57 
     0      0    485      0       0     485       0  22.2g  26.4g  7.22g      0     20.2          0       0|0     0|0   148k    58k    51 my_replication    M   13:46:58 
     0      0    492      0       0     495       0  22.2g  26.4g  7.22g      0     20.6          0       0|0     0|0   151k    59k    51 my_replication    M   13:46:59 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn                  set repl       time 
     0      0    500      0       0     501       0  22.2g  26.4g  7.22g      0       21          0       0|0     0|0   153k    59k    51 my_replication    M   13:47:00 
     0      0    332      0       0     334       0  22.2g  26.4g  7.22g      0       14          0       0|0     0|0   102k    40k    68 my_replication    M   13:47:01 
     0   2459    203      0       0     203       0  22.2g  26.4g  7.22g      0      6.3          0       3|1     5|1   347k   113k    68 my_replication    M   13:47:02 
     0   3253    216      0       0     218       0  22.2g  26.4g  7.22g      0      5.7          0       2|1     8|1   445k   144k    67 my_replication    M   13:47:03 
     0   2166    142      0       0     143       0  22.2g  26.4g  7.22g      0      3.6          0       0|1    13|1   292k    95k    65 my_replication    M   13:47:04 
     0   3232    207      0       0     210       0  22.2g  26.4g  7.22g      0      5.4          0       0|0     2|0   442k   142k    65 my_replication    M   13:47:05 
     0   3318    200      0       0     201       0  22.2g  26.4g  7.22g      0      5.4          0       0|0    10|0   445k   143k    65 my_replication    M   13:47:06 
     0   3096    250      0       0     252       0  22.2g  26.4g  7.22g      0      6.6          0       2|0     2|1   438k   142k    64 my_replication    M   13:47:07 
     0  91650    540      0       0     555       0  22.2g  26.4g  7.22g      0      6.1          0       0|1    12|1    10m     3m    64 my_replication    M   13:47:50 
     0      3      0      0       0      15       0  22.2g  26.4g  7.22g      0      0.1          0       0|0     0|0     1k     8k    50 my_replication    M   13:47:51 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn                  set repl       time 
     0      0      0      0       0       2       0  22.2g  26.4g  7.22g      0        0          0       0|0     0|0   215b     1k    50 my_replication    M   13:47:52 
     0      0      0      0       0       1       0  22.2g  26.4g  7.22g      0        0          0       0|0     0|0    62b     1k    50 my_replication    M   13:47:53 
     0      0      0      0       0       2       0  22.2g  26.4g  7.22g      0        0          0       0|0     0|0   215b     1k    50 my_replication    M   13:47:54

and for a short time, the HDD gets a higher utilization and the average server load increases to around 2.0

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.16    0.00     0.01     0.00   130.83     0.00   10.00   0.52   0.01
sda               0.11    13.45    0.88   14.86     0.06     0.11    22.25     6.62  420.27   4.14   6.51
md1               0.00     0.00    0.00    0.00     0.00     0.00     7.74     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.82   27.96     0.05     0.11    11.32     0.00    0.00   0.00   0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00   226.00    0.00  303.00     0.00     2.39    16.16    47.37  284.78   3.30 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00  426.00     0.00     1.66     7.96     0.00    0.00   0.00   0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00   546.00    0.00  301.00     0.00     3.22    21.93    75.70  221.28   3.32 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00  864.00     0.00     3.37     7.99     0.00    0.00   0.00   0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00   370.00    0.00  213.00     0.00     2.32    22.27    23.09  151.24   4.48  95.40
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00  565.00     0.00     2.17     7.86     0.00    0.00   0.00   0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00     3.00    0.00    7.00     0.00     0.04    11.43     0.10   14.57  14.29  10.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    8.00     0.00     0.03     8.00     0.00    0.00   0.00   0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    1.00     0.00     0.00     8.00     0.03   31.00  31.00   3.10
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

this is the serverStatus():

db.serverStatus()
{
	"host" : "###SERVER1###",
	"version" : "2.0.2",
	"process" : "mongod",
	"uptime" : 243038,
	"uptimeEstimate" : 238124,
	"localTime" : ISODate("2012-03-26T11:56:12.316Z"),
	"globalLock" : {
		"totalTime" : 243037830469,
		"lockTime" : 1148740105,
		"ratio" : 0.0047265896950414235,
		"currentQueue" : {
			"total" : 0,
			"readers" : 0,
			"writers" : 0
		},
		"activeClients" : {
			"total" : 7,
			"readers" : 7,
			"writers" : 0
		}
	},
	"mem" : {
		"bits" : 64,
		"resident" : 7390,
		"virtual" : 27029,
		"supported" : true,
		"mapped" : 22758,
		"note" : "virtual minus mapped is large. could indicate a memory leak"
	},
	"connections" : {
		"current" : 63,
		"available" : 9537
	},
	"extra_info" : {
		"note" : "fields vary by platform",
		"heap_usage_bytes" : 375216,
		"page_faults" : 69745
	},
	"indexCounters" : {
		"btree" : {
			"accesses" : 160738,
			"hits" : 160723,
			"misses" : 15,
			"resets" : 0,
			"missRatio" : 0.00009331956351329492
		}
	},
	"backgroundFlushing" : {
		"flushes" : 4037,
		"total_ms" : 8630655,
		"average_ms" : 2137.8882833787466,
		"last_ms" : 7,
		"last_finished" : ISODate("2012-03-26T11:56:06.277Z")
	},
	"cursors" : {
		"totalOpen" : 1,
		"clientCursors_size" : 1,
		"timedOut" : 0,
		"totalNoTimeout" : 1
	},
	"network" : {
		"bytesIn" : 65203610293,
		"bytesOut" : 71520001080,
		"numRequests" : 548532387
	},
	"repl" : {
		"setName" : "my_replication",
		"ismaster" : true,
		"secondary" : false,
		"hosts" : [
			"###SERVER1###"
		],
		"arbiters" : [
			"###SERVER2###"
		],
		"primary" : "###SERVER1###",
		"me" : "###SERVER1###"
	},
	"opcounters" : {
		"insert" : 8,
		"query" : 541326108,
		"update" : 3466412,
		"delete" : 0,
		"getmore" : 12648,
		"command" : 3727281
	},
	"asserts" : {
		"regular" : 0,
		"warning" : 0,
		"msg" : 0,
		"user" : 572,
		"rollovers" : 0
	},
	"writeBacksQueued" : false,
	"ok" : 1
}

and this are the collection stats:

db.fdy_products.stats()
{
	"ns" : "###my-db###.###my-collection###",
	"count" : 848036,
	"size" : 7271212740,
	"avgObjSize" : 8574.17932729271,
	"storageSize" : 9138339840,
	"numExtents" : 37,
	"nindexes" : 9,
	"lastExtentSize" : 1526575104,
	"paddingFactor" : 1.0099999999512375,
	"flags" : 1,
	"totalIndexSize" : 672054896,
	"indexSizes" : {
		"_id_" : 29507184,
		"_index_1_" : 52445184,
		"_matching_1_" : 64249856,
		"_matching_2_" : 81567744,
		"_index_2_" : 57614336,
		"_index_3_" : 72572928,
		"_index_4" : 156557312,
		"_index_5" : 120020992,
		"_index_6" : 37519360
	},
	"ok" : 1
}

during such parallel reads/updates, almost every day we get at least one timeout (after 30s) even when using findOne(

{ _id: ObjectId('...') }

).. which in my opinion should never happen

we tried both ways of disabling NUMA:

  • just for mongod by adding "numactl --interleave=all " before the mongod command (verified that interleave is being used by checking /proc/#/numa_maps)
  • for the whole OS, by adding numa=off to boot settings (verified by checking the messages log, where it says: "kernel: NUMA turned off")

we're accessing the DB through a PHP (5.3.3) driver (mongo/1.2.7), but i don't think it has anything to do with this issue



 Comments   
Comment by Scott Hernandez (Inactive) [ 18/Oct/12 ]

I see you are running with 2.2.0RC0, can you upgrade to 2.2.0 and see if the problem still occurs. A change was made to heap memory allocation in RC2 which may fix the cause of the high cpu you have seen.

Comment by Eliot Horowitz (Inactive) [ 27/Jun/12 ]

You might want to try lowering the syncdelay.

Comment by Nemanja Dubravac [ 27/Jun/12 ]

there was another 30s timeout today at 10:01:48, and i saw the background flush you mentioned, so it's most probably the HDD writes causing these timeouts
is there a way to fix it, other than getting better drives?

Comment by Nemanja Dubravac [ 26/Jun/12 ]

another timeout after 30 seconds happened today at 26 June 2012 14:01:31 UTC
this was the query:

{_id: {$in: [ #2 mongo object IDs# ]}}

and these are the current OPs at the time:

array (
    0 => 
    array (
      'opid' => 1116100861,
      'active' => false,
      'waitingForLock' => false,
      'op' => 'query',
      'ns' => '',
      'query' => 
      array (
      ),
      'client' => '78.##.###.##:42937',
      'desc' => 'conn',
      'threadId' => '0x7f84fc1fc700',
      'connectionId' => 1154091,
      'numYields' => 0,
    ),
    1 => 
    array (
      'opid' => 1116100863,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '0605279114863',
      ),
      'client' => '78.##.###.##:42932',
      'desc' => 'conn',
      'threadId' => '0x7f88104ff700',
      'connectionId' => 1154088,
      'numYields' => 0,
    ),
    2 => 
    array (
      'opid' => 1116100851,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '9781400106639',
      ),
      'client' => '176.#.###.##:39866',
      'desc' => 'conn',
      'threadId' => '0x7f84ed5d5700',
      'connectionId' => 1154071,
      'numYields' => 0,
    ),
    3 => 
    array (
      'opid' => 1116100837,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'pce',
        'merged_manufacturer_id' => '629040',
      ),
      'client' => '176.#.###.##:39853',
      'desc' => 'conn',
      'threadId' => '0x7f84ee6e6700',
      'connectionId' => 1154070,
      'numYields' => 0,
    ),
    4 => 
    array (
      'opid' => 1116100811,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'graupner',
        'merged_manufacturer_id' => '265318',
      ),
      'client' => '176.#.###.##:39835',
      'desc' => 'conn',
      'threadId' => '0x7f84eeaea700',
      'connectionId' => 1154063,
      'numYields' => 0,
    ),
    5 => 
    array (
      'opid' => 1116100859,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4260087123054',
      ),
      'client' => '176.#.###.##:39829',
      'desc' => 'conn',
      'threadId' => '0x7f84ee3e3700',
      'connectionId' => 1154062,
      'numYields' => 0,
    ),
    6 => 
    array (
      'opid' => 1116100858,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '5992911416627',
      ),
      'client' => '78.##.###.##:42929',
      'desc' => 'conn',
      'threadId' => '0x7f8ab81cd700',
      'connectionId' => 1154085,
      'numYields' => 0,
    ),
    7 => 
    array (
      'opid' => 1116100848,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4260060271093',
      ),
      'client' => '176.#.###.##:39869',
      'desc' => 'conn',
      'threadId' => '0x7f84529f2700',
      'connectionId' => 1154075,
      'numYields' => 0,
    ),
    8 => 
    array (
      'opid' => 1116100854,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4260120086117',
      ),
      'client' => '78.##.###.##:42873',
      'desc' => 'conn',
      'threadId' => '0x7f84ec2c2700',
      'connectionId' => 1154079,
      'numYields' => 0,
    ),
    9 => 
    array (
      'opid' => 1116100835,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '9780007169214',
      ),
      'client' => '78.##.###.##:42924',
      'desc' => 'conn',
      'threadId' => '0x7f8ab3afa700',
      'connectionId' => 1154081,
      'numYields' => 0,
    ),
    10 => 
    array (
      'opid' => 1116100853,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'digitus',
        'merged_manufacturer_id' => 'dn-19tray-1-1000-sw',
      ),
      'client' => '176.#.###.##:39820',
      'desc' => 'conn',
      'threadId' => '0x7f84525ee700',
      'connectionId' => 1154058,
      'numYields' => 0,
    ),
    11 => 
    array (
      'opid' => 1116100867,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '9780553822236',
      ),
      'client' => '78.##.###.##:42934',
      'desc' => 'conn',
      'threadId' => '0x7f84efdfd700',
      'connectionId' => 1154092,
      'numYields' => 0,
    ),
    12 => 
    array (
      'opid' => 1116100871,
      'active' => false,
      'waitingForLock' => false,
      'op' => 'query',
      'ns' => '',
      'query' => 
      array (
      ),
      'client' => '176.#.###.##:39876',
      'desc' => 'conn',
      'threadId' => '0x7f8ab38f8700',
      'connectionId' => 1154094,
      'numYields' => 0,
    ),
    13 => 
    array (
      'opid' => 1116100844,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '9784007002120',
      ),
      'client' => '78.##.###.##:42935',
      'desc' => 'conn',
      'threadId' => '0x7f84fc3fe700',
      'connectionId' => 1154090,
      'numYields' => 0,
    ),
    14 => 
    array (
      'opid' => 1116100873,
      'active' => false,
      'lockType' => 'read',
      'waitingForLock' => false,
      'op' => 'query',
      'ns' => '',
      'query' => 
      array (
        'merged_ean' => '9783802404016',
      ),
      'client' => '78.##.###.##:42926',
      'desc' => 'conn',
      'threadId' => '0x7f88101fc700',
      'connectionId' => 1154082,
      'numYields' => 0,
    ),
    15 => 
    array (
      'opid' => 1116100856,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'hygiene-shop.eu',
        'merged_manufacturer_id' => 'in0000155',
      ),
      'client' => '78.##.###.##:42927',
      'desc' => 'conn',
      'threadId' => '0x7f84ee5e5700',
      'connectionId' => 1154084,
      'numYields' => 0,
    ),
    16 => 
    array (
      'opid' => 1116100865,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4052278489908',
      ),
      'client' => '78.##.###.##:42928',
      'desc' => 'conn',
      'threadId' => '0x7f8ab21fc700',
      'connectionId' => 1154086,
      'numYields' => 0,
    ),
    17 => 
    array (
      'opid' => 1116100855,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4008224528030',
      ),
      'client' => '78.##.###.##:42930',
      'desc' => 'conn',
      'threadId' => '0x7f8ab3efe700',
      'connectionId' => 1154093,
      'numYields' => 0,
    ),
    18 => 
    array (
      'opid' => 1116100850,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => 
        array (
          '$in' => 
          array (
            0 => '3425160177862',
            1 => '3425160265736',
          ),
        ),
      ),
      'client' => '176.#.###.##:39808',
      'desc' => 'conn',
      'threadId' => '0x7f83d1ce4700',
      'connectionId' => 1154054,
      'numYields' => 0,
    ),
    19 => 
    array (
      'opid' => 1116100852,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '5052836920378',
      ),
      'client' => '78.##.###.##:42806',
      'desc' => 'conn',
      'threadId' => '0x7f84ec7c7700',
      'connectionId' => 1153967,
      'numYields' => 0,
    ),
    20 => 
    array (
      'opid' => 1116100819,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '85126589615',
      ),
      'client' => '176.#.###.##:39868',
      'desc' => 'conn',
      'threadId' => '0x7f84ef2f2700',
      'connectionId' => 1154073,
      'numYields' => 0,
    ),
  )

Comment by Nemanja Dubravac [ 25/Jun/12 ]

there are periods where the mms-agent starts using 100% CPU (it's running on another server) and then stops logging mongostat/iostat (sometimes only iostat) for a few minutes, which really makes it hard to analyze timeouts..
it doesn't output any errors in its log file, just restarts itself randomly after these periods of high CPU usage..
i guess i have to file a separate bug for this

for example, there was a timeout after 2 seconds while performing find(

{ _id: $in : [ #array of 83 IDs# ] }

), at Jun 25 2012 11:47:53 UTC

this query asks for only a small subset of fields, so the returned amount of data shouldn't be too big:

  • MongoObject ID
  • array with 1-2 possible int values
  • string up to 50 chars
  • string up to 100 chars
  • small multidimensional array with a few ints and strings
  • small multidimensional array with a few strings
  • int
  • small array with a few ints

by small, i mean an array with 3-5 keys

this is the currentOP:

  array (
    0 => 
    array (
      'opid' => 143077327,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => 
        array (
          '$in' => 
          array (
            0 => '3243480014830',
          ),
        ),
      ),
      'client' => '176.#.###.##:52432',
      'desc' => 'conn',
      'threadId' => '0x7f84ed6d6700',
      'connectionId' => 942255,
      'numYields' => 0,
    ),
    1 => 
    array (
      'opid' => 143077325,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'sdu',
        'merged_manufacturer_id' => '7711804',
      ),
      'client' => '78.##.###.##:58195',
      'desc' => 'conn',
      'threadId' => '0x7f8ab3afa700',
      'connectionId' => 942203,
      'numYields' => 0,
    ),
    2 => 
    array (
      'opid' => 143077328,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'ditech',
        'merged_manufacturer_id' => '0pix',
      ),
      'client' => '176.#.###.##:52435',
      'desc' => 'conn',
      'threadId' => '0x7f84ed5d5700',
      'connectionId' => 942256,
      'numYields' => 0,
    ),
    3 => 
    array (
      'opid' => 143077330,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => 
        array (
          '$in' => 
          array (
            0 => '4260245283040',
          ),
        ),
      ),
      'client' => '176.#.###.##:52441',
      'desc' => 'conn',
      'threadId' => '0x7f84efbfb700',
      'connectionId' => 942258,
      'numYields' => 0,
    ),
    5 => 
    array (
      'opid' => 143077320,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4332163402527',
      ),
      'client' => '78.##.###.##:58204',
      'desc' => 'conn',
      'threadId' => '0x7f84ec3c3700',
      'connectionId' => 942205,
      'numYields' => 0,
    ),
    6 => 
    array (
      'opid' => 143077326,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => 
        array (
          '$in' => 
          array (
            0 => '4016032171713',
          ),
        ),
      ),
      'client' => '176.#.###.##:52426',
      'desc' => 'conn',
      'threadId' => '0x7f84edada700',
      'connectionId' => 942253,
      'numYields' => 0,
    ),
    7 => 
    array (
      'opid' => 143077324,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4004182237823',
      ),
      'client' => '78.##.###.##:58115',
      'desc' => 'conn',
      'threadId' => '0x7f84ee3e3700',
      'connectionId' => 942189,
      'numYields' => 0,
    ),
    8 => 
    array (
      'opid' => 143077331,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        '$msg' => 'query not recording (too large)',
      ),
      'client' => '46.#.###.###:53181',
      'desc' => 'conn',
      'threadId' => '0x7f84ee5e5700',
      'connectionId' => 942261,
      'numYields' => 0,
    ),
  )

a timeout with a query like this also occurred at 25 June 2012 12:01:29, while these were the current OPs:

array (
    0 => 
    array (
      'opid' => 151462272,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '4024488092346',
      ),
      'client' => '78.##.###.##:60242',
      'desc' => 'conn',
      'threadId' => '0x7f84ef7f7700',
      'connectionId' => 944104,
      'numYields' => 0,
    ),
    1 => 
    array (
      'opid' => 151462265,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'emsa',
        'merged_manufacturer_id' => '8519100150',
      ),
      'client' => '78.##.###.##:60224',
      'desc' => 'conn',
      'threadId' => '0x7f84ef4f4700',
      'connectionId' => 944098,
      'numYields' => 0,
    ),
    2 => 
    array (
      'opid' => 151462277,
      'active' => false,
      'waitingForLock' => false,
      'op' => 'query',
      'ns' => '',
      'query' => 
      array (
      ),
      'client' => '78.##.###.##:60140',
      'desc' => 'conn',
      'threadId' => '0x7f8ab3bfb700',
      'connectionId' => 943979,
      'numYields' => 0,
    ),
    3 => 
    array (
      'opid' => 151462275,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'ditech',
        'merged_manufacturer_id' => '0pix',
      ),
      'client' => '176.#.###.##:56527',
      'desc' => 'conn',
      'threadId' => '0x7f8ab3efe700',
      'connectionId' => 944121,
      'numYields' => 0,
    ),
    4 => 
    array (
      'opid' => 151462270,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        '$msg' => 'query not recording (too large)',
      ),
      'client' => '46.#.###.###:54003',
      'desc' => 'conn',
      'threadId' => '0x7f84ee6e6700',
      'connectionId' => 944116,
      'numYields' => 0,
    ),
    5 => 
    array (
      'opid' => 151462271,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => 
        array (
          '$in' => 
          array (
            0 => '3243480014830',
          ),
        ),
      ),
      'client' => '176.#.###.##:56518',
      'desc' => 'conn',
      'threadId' => '0x7f84ec9c9700',
      'connectionId' => 944118,
      'numYields' => 0,
    ),
    6 => 
    array (
      'opid' => 151462260,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'dr. winkler',
        'merged_manufacturer_id' => '560',
      ),
      'client' => '78.##.###.##:60196',
      'desc' => 'conn',
      'threadId' => '0x7f8ab38f8700',
      'connectionId' => 944091,
      'numYields' => 0,
    ),
    7 => 
    array (
      'opid' => 151462261,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'römer',
        'merged_manufacturer_id' => '2000002815',
      ),
      'client' => '78.##.###.##:60161',
      'desc' => 'conn',
      'threadId' => '0x7f84ecece700',
      'connectionId' => 944000,
      'numYields' => 0,
    ),
    8 => 
    array (
      'opid' => 151462276,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'walkntalkonline',
        'merged_manufacturer_id' => 'samsungi9100galaxys2',
      ),
      'client' => '78.##.###.##:60166',
      'desc' => 'conn',
      'threadId' => '0x7f84ec7c7700',
      'connectionId' => 944001,
      'numYields' => 0,
    ),
    9 => 
    array (
      'opid' => 151462274,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'metaltex',
        'merged_manufacturer_id' => '252814010',
      ),
      'client' => '78.##.###.##:60227',
      'desc' => 'conn',
      'threadId' => '0x7f84ec1c1700',
      'connectionId' => 944099,
      'numYields' => 0,
    ),
    10 => 
    array (
      'opid' => 151462264,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'selfsat',
        'merged_manufacturer_id' => 'hh-90',
      ),
      'client' => '78.##.###.##:60205',
      'desc' => 'conn',
      'threadId' => '0x7f88101fc700',
      'connectionId' => 944093,
      'numYields' => 0,
    ),
    11 => 
    array (
      'opid' => 151462266,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '0886489256639',
      ),
      'client' => '78.##.###.##:60177',
      'desc' => 'conn',
      'threadId' => '0x7f83d24ec700',
      'connectionId' => 944004,
      'numYields' => 0,
    ),
    12 => 
    array (
      'opid' => 151462262,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'supplier_s' => 'heyda',
        'merged_manufacturer_id' => '204716252',
      ),
      'client' => '78.##.###.##:60229',
      'desc' => 'conn',
      'threadId' => '0x7f84efafa700',
      'connectionId' => 944100,
      'numYields' => 0,
    ),
    13 => 
    array (
      'opid' => 151462268,
      'active' => true,
      'lockType' => 'read',
      'waitingForLock' => false,
      'secs_running' => 0,
      'op' => 'query',
      'ns' => '###-production.###_products',
      'query' => 
      array (
        'merged_ean' => '5053072968506',
      ),
      'client' => '78.##.###.##:60156',
      'desc' => 'conn',
      'threadId' => '0x7f84edada700',
      'connectionId' => 943994,
      'numYields' => 0,
    ),
  )

Comment by Nemanja Dubravac [ 19/Jun/12 ]

i have added the currentOp query on timeout exception, so now we wait..
we can leave the issue public for now

Comment by Scott Hernandez (Inactive) [ 19/Jun/12 ]

I've attached an MMS capture of the relevent minutes and it looks like a disk event related to background flushing and page faults which slowed things down for those queries. The currentOp might help narrow things down a bit once we get that.

There is a similar and worse event at ~8:47; do you get slowdowns/timeouts then too?

It is also worth noting that the cpu io wait graphs also support this showing higher io wait at the same time.

Please let us know if you want to make this issue private as we are now asking for real queries from your system (the currentOp info).

Comment by Scott Hernandez (Inactive) [ 19/Jun/12 ]

Looking at MMS now; when you get these timeout can you dump the output of db.currentOp()? This will help us see what else is going on at that moment.

If you want to put this in your application all you need to do is a fineOne on the "$cmd.sys.inprog" collection. If you tell me what language you are using I can help you craft that query.

Comment by Nemanja Dubravac [ 19/Jun/12 ]

we still get occasional timeouts
for example, at Jun 18 2012 08:17:58 UTC:

{_id: {$in: [ #array of 86 IDs# ] }}

timed out after 4 attempts, each having a 1, 2, 3 and 4 seconds timeout, respectively

and this one happened at Jun 18 2012 07:47:47 UTC:

{_id: ObjectId('#random ID#')}

timed out after 30 seconds

Comment by Nemanja Dubravac [ 18/Jun/12 ]

i have just upgraded the server to 2.0.6 and compacted the collection 30 minutes ago

Comment by Nemanja Dubravac [ 18/Jun/12 ]

i just finished optimizing one of our most often executed queries, so that we no longer use a field ('removed') with low cardinality in the query
these queries were never logged in the profiler, because they never took more than 100 ms, but since there's so many of them, they add up

before we used this:

{ean: '#random ean string#', removed: {$ne: 1}}

{brand_s: '#random brand string#', manufacturer_id_s: '#random mpn string#', removed: {$ne: 1}}

since 'removed' field can have only these values: 0, 1, 2, 3, 4, i realized it would be better to move that check to the application
so now we're quering for

{ean: '#random ean number#'}

and

{brand_s: '#random brand string#', manufacturer_id_s: '#random mpn string#'}

only

since this change (half an hour ago) the number of queries/s has doubled from 3000 to 6000, but we're still seeing very high CPU usage (~700%)

i still have to wait to see if a random access by _id occasionally times out as it did before

just for reference in MMS, i'm talking about server01 the whole time

Comment by Scott Hernandez (Inactive) [ 18/Jun/12 ]

Is this still happening? Now that we have MMS+Munin being collected it would be helpful if you can point out some times (in UTC) when the slowdown is happening.

Comment by Nemanja Dubravac [ 18/Jun/12 ]

unfortunatelly, it doesn't change a thing for us.. :/
any other suggestions?

Comment by Dumitru Alin Silvian [ 15/Jun/12 ]

Apparently restarting mongod with interleave=all fixed the issue( although /proc/sys/vm/zone_reclaim_mode was 0 ). The behavior never repeated itself within the last week.

Comment by Nemanja Dubravac [ 11/Jun/12 ]

almost 2 weeks w/o a reply..

could it be some CentOS specific thing? (NUMA is disabled, as mentioned above)
what else should i test?

Comment by Nemanja Dubravac [ 31/May/12 ]

insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn                  set repl       time 
     0   9019      5      0       0       2       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     1|0     1m   342k    42 ###_replication    M   10:56:23 
     0   9387      2      0       0       1       0  22.2g  26.7g  16.8g      0        0          0       0|0     7|0     1m   355k    40 ###_replication    M   10:56:24 
     0   9310      2      0       0       2       0  22.2g  26.7g  16.8g      0        0          0       0|0     5|0     1m   376k    40 ###_replication    M   10:56:25 
     0   9169      5      0       0       1       0  22.2g  26.7g  16.8g      1      0.1          0       0|0     8|0     1m   357k    41 ###_replication    M   10:56:26 
     0   8882      5      0       0       2       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     7|0     1m   332k    41 ###_replication    M   10:56:27 
     0   9412      4      0       0       1       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     4|0     1m   352k    39 ###_replication    M   10:56:28 
     0  10526      2      0       0       2       0  22.2g  26.7g  16.8g      0        0          0       0|0     4|0     1m   414k    39 ###_replication    M   10:56:29 
     0   9132      6      0       0       1       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     8|0     1m   347k    38 ###_replication    M   10:56:30 
     0   9375      4      0       0       3       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     6|0     1m   354k    38 ###_replication    M   10:56:31 
     0  10090      6      0       0       1       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     1|0     1m   373k    38 ###_replication    M   10:56:32 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn                  set repl       time 
     0  10279      4      0       0      12       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     2|0     1m   791k    38 ###_replication    M   10:56:33 
     0  10563      5      0       0       1       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     4|0     1m   393k    37 ###_replication    M   10:56:34 
     0  10930      2      0       0       2       0  22.2g  26.7g  16.8g      0        0          0       0|0     5|0     1m   413k    38 ###_replication    M   10:56:35 
     0  10285      4      0       0       1       0  22.2g  26.7g  16.8g      0        0          0       0|0     8|0     1m   400k    37 ###_replication    M   10:56:36 
     0  10670      4      0       0       2       0  22.2g  26.7g  16.8g      0      0.1          0       0|0     2|0     1m   400k    37 ###_replication    M   10:56:37 

there are many other queries running on the server, as you can see by the query count, but none of them get reported in the profiling collection (they are all using indexes)

and this is the query explain, it was run at 10:56:30

db.products.find({solr_indexed: 0}).explain()
{
	"cursor" : "BtreeCursor _solr_indexed_",
	"nscanned" : 1796,
	"nscannedObjects" : 1796,
	"n" : 1796,
	"millis" : 16,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"isMultiKey" : false,
	"indexOnly" : false,
	"indexBounds" : {
		"solr_indexed" : [
			[
				0,
				0
			]
		]
	}
}

there is one slow query being reported in the profiling, but i think the speed is reasonable, since it's a big update:

{
   "ts": ISODate("2012-05-31T08: 20: 36.749Z"),
   "op": "update",
   "ns": "###-production.products",
   "query": {
     "_id": {
       "$in": {
         "0": ObjectId("4ed6143c8c40045aa7483fbf"),
         "1": ObjectId("4ed613e48c40045aa7483f42"),
         "2": ObjectId("4ed6140c8c40045aa7483f64"),
         "3": ObjectId("4e83dd228c40045aa747435b"),
         "4": ObjectId("4ed614098c40045aa7483f52"),
         "5": ObjectId("4ed6140a8c40045aa7483f53"),
         "6": ObjectId("4ed614658c40045aa7483fe2"),
         "7": ObjectId("4ed614658c40045aa7483fe4"),
 
... up to 1000 object IDs
 
      } 
    } 
  },
   "updateobj": {
     "$set": {
       "solr_indexed": 1 
    } 
  },
   "keyUpdates": 1,
   "millis": 211,
   "client": "###",
   "user": "" 
}
 

Comment by Scott Hernandez (Inactive) [ 29/May/12 ]

Dumitru, can you create a new issue with your stats (cpu/mongostat/iostat/etc)?

Nemanja, it looks like you are returning lots of documents and that takes a while. Can you post mongostat numbers when you run the explain again? Please note the start/end time.

Comment by Dumitru Alin Silvian [ 29/May/12 ]

I have a similar issue. High CPU usage ( 2200% - 24 cores ) load < 1 on 2.6.32-220.4.1.el6.x86_64.
Cpu(s): 5.7%us, 81.6%sy, 0.0%ni, 12.2%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
26483 mongo 20 0 51.4g 6.8g 6.2g S 2082.4 3.6 20833:13 mongod
It seems strange that 81% of all CPU is used by system on behalf of mongod.

Another thing I also make queries with "$in", in a replica set. However when I rs.stepDown() the problem server and the other one becomes primary this behavior stops. And for the same amount of queries I get less than 100% usage. Both servers are configured the same way and have similar hardware. This https://jira.mongodb.org/browse/SERVER-3497 is not the issue.

Comment by Nemanja Dubravac [ 29/May/12 ]

> db.products.find({solr_indexed: 0}).explain()
{
	"cursor" : "BtreeCursor _solr_indexed_",
	"nscanned" : 24804,
	"nscannedObjects" : 24804,
	"n" : 24804,
	"millis" : 399,
	"nYields" : 1,
	"nChunkSkips" : 0,
	"isMultiKey" : false,
	"indexOnly" : false,
	"indexBounds" : {
		"solr_indexed" : [
			[
				0,
				0
			]
		]
	}
}

there are 893588 documents in total in this collection

i really don't think this is the query that's causing random slowdowns, because it's not run that often..

Comment by Scott Hernandez (Inactive) [ 29/May/12 ]

It looks like the query may be different (the results clearly are), can you please post the query along with the explain?

Comment by Nemanja Dubravac [ 29/May/12 ]

there is an index on this solr_indexed field
and it is using the index when i perform the same query manually:

{
   "cursor": "BtreeCursor _solr_indexed_",
   "nscanned": 5551,
   "nscannedObjects": 5551,
   "n": 5551,
   "scanAndOrder": true,
   "millis": 756,
   "nYields": 11,
   "nChunkSkips": 0,
   "isMultiKey": false,
   "indexOnly": false,
   "indexBounds": {
     "solr_indexed": [
       [
         0,
         0 
      ] 
    ] 
  },
   "allPlans": [
     {
       "cursor": "BtreeCursor _id_ reverse",
       "indexBounds": {
         "_id": [
           [
             {
               "$maxElement": 1 
            },
             {
               "$minElement": 1 
            } 
          ] 
        ] 
      } 
    },
     {
       "cursor": "BtreeCursor _solr_indexed_",
       "indexBounds": {
         "solr_indexed": [
           [
             0,
             0 
          ] 
        ] 
      } 
    },
     {
       "cursor": "BasicCursor",
       "indexBounds": [
         
      ] 
    } 
  ],
   "oldPlan": {
     "cursor": "BtreeCursor _solr_indexed_",
     "indexBounds": {
       "solr_indexed": [
         [
           0,
           0 
        ] 
      ] 
    } 
  } 
}

Comment by Eliot Horowitz (Inactive) [ 26/May/12 ]

This looks like its not using an index:

{
   "ts": ISODate("2012-05-01T11: 18: 49.664Z"),
   "op": "query",
   "ns": "###-production.products",
   "query": {
     "query": {
       "solr_indexed": 0 
    },
     "$explain": true 
  },
   "nscanned": 28173,
   "nreturned": 1,
   "responseLength": 489,
   "millis": 1366,
   "client": "127.0.0.1",
   "user": "" 
}

Comment by Nemanja Dubravac [ 01/May/12 ]

No, as far as I know, all our queries are using indexes..
I started logging slow queries, and here are a few examples:

{
   "ts": ISODate("2012-05-01T10: 29: 09.885Z"),
   "op": "update",
   "ns": "###-production.products",
   "query": {
     "_id": {
       "$in": {
         "0": ObjectId("4e32e402e8a5c589d27a0827"),
         "1": ObjectId("4ea753cc8c40045aa7477522"),
         "2": ObjectId("4e32e406e8a5c589d27a0b10"),
         "3": ObjectId("4e32e873e8a5c589d27d0604"),
         "4": ObjectId("4ea753d78c40045aa7477570"),
         "5": ObjectId("4e32e42ce8a5c589d27a2616"),
         "6": ObjectId("4e32e42ce8a5c589d27a2621"),
         "7": ObjectId("4e32e42ce8a5c589d27a2651"),
         "8": ObjectId("4e32e42ce8a5c589d27a266c"),
         "9": ObjectId("4e32e42de8a5c589d27a26cf"),
         ...
         ...
         "998": ObjectId("4e32e477e8a5c589d27a58c3"),
         "999": ObjectId("4e32e5cde8a5c589d27b5e46") 
      } 
    } 
  },
   "updateobj": {
     "$set": {
       "solr_indexed": 1 
    } 
  },
   "millis": 8491,
   "client": "46.4.119.174",
   "user": "" 
}
 
{
   "ts": ISODate("2012-05-01T09: 20: 38.418Z"),
   "op": "update",
   "ns": "###-production.products",
   "query": {
     "_id": ObjectId("4e32efa1e8a5c589d2815247") 
  },
   "updateobj": {
     "$set": {
       "popularity": 26068,
       "price.93.offer_count": 1,
       "price.93.min": "4.87",
       "solr_indexed": 0,
       "price.323.offer_count": 1,
       "price.323.min": "4.87" 
    } 
  },
   "nscanned": 1,
   "keyUpdates": 3,
   "millis": 172,
   "client": "78.46.151.34",
   "user": "" 
}
 
{
   "ts": ISODate("2012-05-01T11: 18: 49.664Z"),
   "op": "query",
   "ns": "###-production.products",
   "query": {
     "query": {
       "solr_indexed": 0 
    },
     "$explain": true 
  },
   "nscanned": 28173,
   "nreturned": 1,
   "responseLength": 489,
   "millis": 1366,
   "client": "127.0.0.1",
   "user": "" 
}
 
{
   "ts": ISODate("2012-05-01T11: 18: 05.201Z"),
   "op": "command",
   "ns": "###-production.$cmd",
   "command": {
     "count": "products",
     "query": {
       "solr_indexed": 0 
    } 
  },
   "ntoreturn": 1,
   "responseLength": 48,
   "millis": 935,
   "client": "46.4.119.174",
   "user": "" 
}

both _id and solr_indexed have their own indexes

as you see, it's mostly updates and they're using only _id in the query, so the only thing that makes them slow can be index rebuilding, but sometimes there are even findOne({_id: ObjectId('###')}) queries which time out even after 30 seconds.. this is in my opinion unacceptable for a decent DBMS

example:

[01-May-2012 08:18:05] PHP Fatal error:  Uncaught exception 'MongoCursorTimeoutException' with message 'cursor timed out (timeout: 30000, time left: 0:0, status: 0)' in /var/www/html/###/includes/db/mongomongo.php:163
Stack trace:
#0 /var/www/html/###/includes/db/mongomongo.php(163): MongoCollection->findOne(Array, Array)
 
where the query looks like this:
{
	_id: ObjectId('4e32eb22e8a5c589d27e979c')				
}

free -ltm results:

2012/05/01 11:05
             total       used       free     shared    buffers     cached
Mem:         24022       5704      18317          0         29       5193
Low:         24022       5704      18317
High:            0          0          0
-/+ buffers/cache:        481      23541
Swap:         2046         29       2017
Total:       26069       5734      20335
 
2012/05/01 11:06
             total       used       free     shared    buffers     cached
Mem:         24022       5708      18313          0         29       5193
Low:         24022       5708      18313
High:            0          0          0
-/+ buffers/cache:        485      23537
Swap:         2046         29       2017
Total:       26069       5738      20331
 
2012/05/01 11:07
             total       used       free     shared    buffers     cached
Mem:         24022       5706      18315          0         29       5194
Low:         24022       5706      18315
High:            0          0          0
-/+ buffers/cache:        482      23539
Swap:         2046         29       2017
Total:       26069       5736      20333
 
2012/05/01 11:11
             total       used       free     shared    buffers     cached
Mem:         24022       5740      18281          0         30       5221
Low:         24022       5740      18281
High:            0          0          0
-/+ buffers/cache:        488      23533
Swap:         2046         29       2017
Total:       26069       5770      20299
 
2012/05/01 11:44
             total       used       free     shared    buffers     cached
Mem:         24022       5922      18100          0         33       5379
Low:         24022       5922      18100
High:            0          0          0
-/+ buffers/cache:        509      23513
Swap:         2046         29       2017
Total:       26069       5951      20117
 
# after running a process which reads all objects in the collection, and updates a few:
 
2012/05/01 12:43
             total       used       free     shared    buffers     cached
Mem:         24022       9291      14731          0         38       8714
Low:         24022       9291      14731
High:            0          0          0
-/+ buffers/cache:        537      23484
Swap:         2046         29       2017
Total:       26069       9320      16748
 
# after running the same process again, but this time updating most of them:
 
2012/05/01 13:2
             total       used       free     shared    buffers     cached
Mem:         24022       9652      14369          0         42       8912
Low:         24022       9652      14369
High:            0          0          0
-/+ buffers/cache:        698      23323
Swap:         2046         21       2025
Total:       26069       9674      16395

Comment by Scott Hernandez (Inactive) [ 29/Mar/12 ]

MMS shows that you are running very hot on the CPU. Do you have any slow queries which aren't using indexes? I would guess you need to tune your indexes for your queries since your cpu is so high. Can you enable database profiling to look for slow queries? http://www.mongodb.org/display/DOCS/Database+Profiler

When you get a timeout on findOne(...) can you do a find(...).explain() and log the results? It will show more information about what is going on at that moment.

Can you tell me a bit more about the memory usage by running free -ltm periodically?

Comment by Nemanja Dubravac [ 29/Mar/12 ]

is this the maximum mongodb can do on such a machine?

Comment by Nemanja Dubravac [ 26/Mar/12 ]

i have now attached the MMS stats when i start (around 18:04) the rest of the scripts that query / update the collection and run in parallel

Comment by Nemanja Dubravac [ 26/Mar/12 ]

i've enabled the monitoring now..
can you see the data on your own, or should i attach screenshots?

Comment by Scott Hernandez (Inactive) [ 26/Mar/12 ]

Can you enable munin on your monitored hosts (MMS)? http://mms.10gen.com/help/install.html#hardware-monitoring-with-munin-node

Generated at Thu Feb 08 03:08:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.