|
I see you are running with 2.2.0RC0, can you upgrade to 2.2.0 and see if the problem still occurs. A change was made to heap memory allocation in RC2 which may fix the cause of the high cpu you have seen.
|
|
You might want to try lowering the syncdelay.
|
|
there was another 30s timeout today at 10:01:48, and i saw the background flush you mentioned, so it's most probably the HDD writes causing these timeouts
is there a way to fix it, other than getting better drives?
|
|
another timeout after 30 seconds happened today at 26 June 2012 14:01:31 UTC
this was the query:
{_id: {$in: [ #2 mongo object IDs# ]}}
|
and these are the current OPs at the time:
array (
|
0 =>
|
array (
|
'opid' => 1116100861,
|
'active' => false,
|
'waitingForLock' => false,
|
'op' => 'query',
|
'ns' => '',
|
'query' =>
|
array (
|
),
|
'client' => '78.##.###.##:42937',
|
'desc' => 'conn',
|
'threadId' => '0x7f84fc1fc700',
|
'connectionId' => 1154091,
|
'numYields' => 0,
|
),
|
1 =>
|
array (
|
'opid' => 1116100863,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '0605279114863',
|
),
|
'client' => '78.##.###.##:42932',
|
'desc' => 'conn',
|
'threadId' => '0x7f88104ff700',
|
'connectionId' => 1154088,
|
'numYields' => 0,
|
),
|
2 =>
|
array (
|
'opid' => 1116100851,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '9781400106639',
|
),
|
'client' => '176.#.###.##:39866',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ed5d5700',
|
'connectionId' => 1154071,
|
'numYields' => 0,
|
),
|
3 =>
|
array (
|
'opid' => 1116100837,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'pce',
|
'merged_manufacturer_id' => '629040',
|
),
|
'client' => '176.#.###.##:39853',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ee6e6700',
|
'connectionId' => 1154070,
|
'numYields' => 0,
|
),
|
4 =>
|
array (
|
'opid' => 1116100811,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'graupner',
|
'merged_manufacturer_id' => '265318',
|
),
|
'client' => '176.#.###.##:39835',
|
'desc' => 'conn',
|
'threadId' => '0x7f84eeaea700',
|
'connectionId' => 1154063,
|
'numYields' => 0,
|
),
|
5 =>
|
array (
|
'opid' => 1116100859,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4260087123054',
|
),
|
'client' => '176.#.###.##:39829',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ee3e3700',
|
'connectionId' => 1154062,
|
'numYields' => 0,
|
),
|
6 =>
|
array (
|
'opid' => 1116100858,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '5992911416627',
|
),
|
'client' => '78.##.###.##:42929',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab81cd700',
|
'connectionId' => 1154085,
|
'numYields' => 0,
|
),
|
7 =>
|
array (
|
'opid' => 1116100848,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4260060271093',
|
),
|
'client' => '176.#.###.##:39869',
|
'desc' => 'conn',
|
'threadId' => '0x7f84529f2700',
|
'connectionId' => 1154075,
|
'numYields' => 0,
|
),
|
8 =>
|
array (
|
'opid' => 1116100854,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4260120086117',
|
),
|
'client' => '78.##.###.##:42873',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ec2c2700',
|
'connectionId' => 1154079,
|
'numYields' => 0,
|
),
|
9 =>
|
array (
|
'opid' => 1116100835,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '9780007169214',
|
),
|
'client' => '78.##.###.##:42924',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab3afa700',
|
'connectionId' => 1154081,
|
'numYields' => 0,
|
),
|
10 =>
|
array (
|
'opid' => 1116100853,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'digitus',
|
'merged_manufacturer_id' => 'dn-19tray-1-1000-sw',
|
),
|
'client' => '176.#.###.##:39820',
|
'desc' => 'conn',
|
'threadId' => '0x7f84525ee700',
|
'connectionId' => 1154058,
|
'numYields' => 0,
|
),
|
11 =>
|
array (
|
'opid' => 1116100867,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '9780553822236',
|
),
|
'client' => '78.##.###.##:42934',
|
'desc' => 'conn',
|
'threadId' => '0x7f84efdfd700',
|
'connectionId' => 1154092,
|
'numYields' => 0,
|
),
|
12 =>
|
array (
|
'opid' => 1116100871,
|
'active' => false,
|
'waitingForLock' => false,
|
'op' => 'query',
|
'ns' => '',
|
'query' =>
|
array (
|
),
|
'client' => '176.#.###.##:39876',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab38f8700',
|
'connectionId' => 1154094,
|
'numYields' => 0,
|
),
|
13 =>
|
array (
|
'opid' => 1116100844,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '9784007002120',
|
),
|
'client' => '78.##.###.##:42935',
|
'desc' => 'conn',
|
'threadId' => '0x7f84fc3fe700',
|
'connectionId' => 1154090,
|
'numYields' => 0,
|
),
|
14 =>
|
array (
|
'opid' => 1116100873,
|
'active' => false,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'op' => 'query',
|
'ns' => '',
|
'query' =>
|
array (
|
'merged_ean' => '9783802404016',
|
),
|
'client' => '78.##.###.##:42926',
|
'desc' => 'conn',
|
'threadId' => '0x7f88101fc700',
|
'connectionId' => 1154082,
|
'numYields' => 0,
|
),
|
15 =>
|
array (
|
'opid' => 1116100856,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'hygiene-shop.eu',
|
'merged_manufacturer_id' => 'in0000155',
|
),
|
'client' => '78.##.###.##:42927',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ee5e5700',
|
'connectionId' => 1154084,
|
'numYields' => 0,
|
),
|
16 =>
|
array (
|
'opid' => 1116100865,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4052278489908',
|
),
|
'client' => '78.##.###.##:42928',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab21fc700',
|
'connectionId' => 1154086,
|
'numYields' => 0,
|
),
|
17 =>
|
array (
|
'opid' => 1116100855,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4008224528030',
|
),
|
'client' => '78.##.###.##:42930',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab3efe700',
|
'connectionId' => 1154093,
|
'numYields' => 0,
|
),
|
18 =>
|
array (
|
'opid' => 1116100850,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' =>
|
array (
|
'$in' =>
|
array (
|
0 => '3425160177862',
|
1 => '3425160265736',
|
),
|
),
|
),
|
'client' => '176.#.###.##:39808',
|
'desc' => 'conn',
|
'threadId' => '0x7f83d1ce4700',
|
'connectionId' => 1154054,
|
'numYields' => 0,
|
),
|
19 =>
|
array (
|
'opid' => 1116100852,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '5052836920378',
|
),
|
'client' => '78.##.###.##:42806',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ec7c7700',
|
'connectionId' => 1153967,
|
'numYields' => 0,
|
),
|
20 =>
|
array (
|
'opid' => 1116100819,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '85126589615',
|
),
|
'client' => '176.#.###.##:39868',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ef2f2700',
|
'connectionId' => 1154073,
|
'numYields' => 0,
|
),
|
)
|
|
|
there are periods where the mms-agent starts using 100% CPU (it's running on another server) and then stops logging mongostat/iostat (sometimes only iostat) for a few minutes, which really makes it hard to analyze timeouts..
it doesn't output any errors in its log file, just restarts itself randomly after these periods of high CPU usage..
i guess i have to file a separate bug for this 
for example, there was a timeout after 2 seconds while performing find(
{ _id: $in : [ #array of 83 IDs# ] }
), at Jun 25 2012 11:47:53 UTC
this query asks for only a small subset of fields, so the returned amount of data shouldn't be too big:
- MongoObject ID
- array with 1-2 possible int values
- string up to 50 chars
- string up to 100 chars
- small multidimensional array with a few ints and strings
- small multidimensional array with a few strings
- int
- small array with a few ints
by small, i mean an array with 3-5 keys
this is the currentOP:
array (
|
0 =>
|
array (
|
'opid' => 143077327,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' =>
|
array (
|
'$in' =>
|
array (
|
0 => '3243480014830',
|
),
|
),
|
),
|
'client' => '176.#.###.##:52432',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ed6d6700',
|
'connectionId' => 942255,
|
'numYields' => 0,
|
),
|
1 =>
|
array (
|
'opid' => 143077325,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'sdu',
|
'merged_manufacturer_id' => '7711804',
|
),
|
'client' => '78.##.###.##:58195',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab3afa700',
|
'connectionId' => 942203,
|
'numYields' => 0,
|
),
|
2 =>
|
array (
|
'opid' => 143077328,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'ditech',
|
'merged_manufacturer_id' => '0pix',
|
),
|
'client' => '176.#.###.##:52435',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ed5d5700',
|
'connectionId' => 942256,
|
'numYields' => 0,
|
),
|
3 =>
|
array (
|
'opid' => 143077330,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' =>
|
array (
|
'$in' =>
|
array (
|
0 => '4260245283040',
|
),
|
),
|
),
|
'client' => '176.#.###.##:52441',
|
'desc' => 'conn',
|
'threadId' => '0x7f84efbfb700',
|
'connectionId' => 942258,
|
'numYields' => 0,
|
),
|
5 =>
|
array (
|
'opid' => 143077320,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4332163402527',
|
),
|
'client' => '78.##.###.##:58204',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ec3c3700',
|
'connectionId' => 942205,
|
'numYields' => 0,
|
),
|
6 =>
|
array (
|
'opid' => 143077326,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' =>
|
array (
|
'$in' =>
|
array (
|
0 => '4016032171713',
|
),
|
),
|
),
|
'client' => '176.#.###.##:52426',
|
'desc' => 'conn',
|
'threadId' => '0x7f84edada700',
|
'connectionId' => 942253,
|
'numYields' => 0,
|
),
|
7 =>
|
array (
|
'opid' => 143077324,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4004182237823',
|
),
|
'client' => '78.##.###.##:58115',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ee3e3700',
|
'connectionId' => 942189,
|
'numYields' => 0,
|
),
|
8 =>
|
array (
|
'opid' => 143077331,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'$msg' => 'query not recording (too large)',
|
),
|
'client' => '46.#.###.###:53181',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ee5e5700',
|
'connectionId' => 942261,
|
'numYields' => 0,
|
),
|
)
|
a timeout with a query like this also occurred at 25 June 2012 12:01:29, while these were the current OPs:
array (
|
0 =>
|
array (
|
'opid' => 151462272,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '4024488092346',
|
),
|
'client' => '78.##.###.##:60242',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ef7f7700',
|
'connectionId' => 944104,
|
'numYields' => 0,
|
),
|
1 =>
|
array (
|
'opid' => 151462265,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'emsa',
|
'merged_manufacturer_id' => '8519100150',
|
),
|
'client' => '78.##.###.##:60224',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ef4f4700',
|
'connectionId' => 944098,
|
'numYields' => 0,
|
),
|
2 =>
|
array (
|
'opid' => 151462277,
|
'active' => false,
|
'waitingForLock' => false,
|
'op' => 'query',
|
'ns' => '',
|
'query' =>
|
array (
|
),
|
'client' => '78.##.###.##:60140',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab3bfb700',
|
'connectionId' => 943979,
|
'numYields' => 0,
|
),
|
3 =>
|
array (
|
'opid' => 151462275,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'ditech',
|
'merged_manufacturer_id' => '0pix',
|
),
|
'client' => '176.#.###.##:56527',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab3efe700',
|
'connectionId' => 944121,
|
'numYields' => 0,
|
),
|
4 =>
|
array (
|
'opid' => 151462270,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'$msg' => 'query not recording (too large)',
|
),
|
'client' => '46.#.###.###:54003',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ee6e6700',
|
'connectionId' => 944116,
|
'numYields' => 0,
|
),
|
5 =>
|
array (
|
'opid' => 151462271,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' =>
|
array (
|
'$in' =>
|
array (
|
0 => '3243480014830',
|
),
|
),
|
),
|
'client' => '176.#.###.##:56518',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ec9c9700',
|
'connectionId' => 944118,
|
'numYields' => 0,
|
),
|
6 =>
|
array (
|
'opid' => 151462260,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'dr. winkler',
|
'merged_manufacturer_id' => '560',
|
),
|
'client' => '78.##.###.##:60196',
|
'desc' => 'conn',
|
'threadId' => '0x7f8ab38f8700',
|
'connectionId' => 944091,
|
'numYields' => 0,
|
),
|
7 =>
|
array (
|
'opid' => 151462261,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'römer',
|
'merged_manufacturer_id' => '2000002815',
|
),
|
'client' => '78.##.###.##:60161',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ecece700',
|
'connectionId' => 944000,
|
'numYields' => 0,
|
),
|
8 =>
|
array (
|
'opid' => 151462276,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'walkntalkonline',
|
'merged_manufacturer_id' => 'samsungi9100galaxys2',
|
),
|
'client' => '78.##.###.##:60166',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ec7c7700',
|
'connectionId' => 944001,
|
'numYields' => 0,
|
),
|
9 =>
|
array (
|
'opid' => 151462274,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'metaltex',
|
'merged_manufacturer_id' => '252814010',
|
),
|
'client' => '78.##.###.##:60227',
|
'desc' => 'conn',
|
'threadId' => '0x7f84ec1c1700',
|
'connectionId' => 944099,
|
'numYields' => 0,
|
),
|
10 =>
|
array (
|
'opid' => 151462264,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'selfsat',
|
'merged_manufacturer_id' => 'hh-90',
|
),
|
'client' => '78.##.###.##:60205',
|
'desc' => 'conn',
|
'threadId' => '0x7f88101fc700',
|
'connectionId' => 944093,
|
'numYields' => 0,
|
),
|
11 =>
|
array (
|
'opid' => 151462266,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '0886489256639',
|
),
|
'client' => '78.##.###.##:60177',
|
'desc' => 'conn',
|
'threadId' => '0x7f83d24ec700',
|
'connectionId' => 944004,
|
'numYields' => 0,
|
),
|
12 =>
|
array (
|
'opid' => 151462262,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'supplier_s' => 'heyda',
|
'merged_manufacturer_id' => '204716252',
|
),
|
'client' => '78.##.###.##:60229',
|
'desc' => 'conn',
|
'threadId' => '0x7f84efafa700',
|
'connectionId' => 944100,
|
'numYields' => 0,
|
),
|
13 =>
|
array (
|
'opid' => 151462268,
|
'active' => true,
|
'lockType' => 'read',
|
'waitingForLock' => false,
|
'secs_running' => 0,
|
'op' => 'query',
|
'ns' => '###-production.###_products',
|
'query' =>
|
array (
|
'merged_ean' => '5053072968506',
|
),
|
'client' => '78.##.###.##:60156',
|
'desc' => 'conn',
|
'threadId' => '0x7f84edada700',
|
'connectionId' => 943994,
|
'numYields' => 0,
|
),
|
)
|
|
|
i have added the currentOp query on timeout exception, so now we wait.. 
we can leave the issue public for now
|
|
I've attached an MMS capture of the relevent minutes and it looks like a disk event related to background flushing and page faults which slowed things down for those queries. The currentOp might help narrow things down a bit once we get that.
There is a similar and worse event at ~8:47; do you get slowdowns/timeouts then too?
It is also worth noting that the cpu io wait graphs also support this showing higher io wait at the same time.
Please let us know if you want to make this issue private as we are now asking for real queries from your system (the currentOp info).
|
|
Looking at MMS now; when you get these timeout can you dump the output of db.currentOp()? This will help us see what else is going on at that moment.
If you want to put this in your application all you need to do is a fineOne on the "$cmd.sys.inprog" collection. If you tell me what language you are using I can help you craft that query.
|
|
we still get occasional timeouts
for example, at Jun 18 2012 08:17:58 UTC:
{_id: {$in: [ #array of 86 IDs# ] }}
|
timed out after 4 attempts, each having a 1, 2, 3 and 4 seconds timeout, respectively
and this one happened at Jun 18 2012 07:47:47 UTC:
{_id: ObjectId('#random ID#')}
|
timed out after 30 seconds
|
|
i have just upgraded the server to 2.0.6 and compacted the collection 30 minutes ago
|
|
i just finished optimizing one of our most often executed queries, so that we no longer use a field ('removed') with low cardinality in the query
these queries were never logged in the profiler, because they never took more than 100 ms, but since there's so many of them, they add up
before we used this:
{ean: '#random ean string#', removed: {$ne: 1}}
{brand_s: '#random brand string#', manufacturer_id_s: '#random mpn string#', removed: {$ne: 1}}
since 'removed' field can have only these values: 0, 1, 2, 3, 4, i realized it would be better to move that check to the application
so now we're quering for
{ean: '#random ean number#'}
and
{brand_s: '#random brand string#', manufacturer_id_s: '#random mpn string#'}
only
since this change (half an hour ago) the number of queries/s has doubled from 3000 to 6000, but we're still seeing very high CPU usage (~700%)
i still have to wait to see if a random access by _id occasionally times out as it did before
just for reference in MMS, i'm talking about server01 the whole time
|
|
Is this still happening? Now that we have MMS+Munin being collected it would be helpful if you can point out some times (in UTC) when the slowdown is happening.
|
|
unfortunatelly, it doesn't change a thing for us.. :/
any other suggestions?
|
|
Apparently restarting mongod with interleave=all fixed the issue( although /proc/sys/vm/zone_reclaim_mode was 0 ). The behavior never repeated itself within the last week.
|
|
almost 2 weeks w/o a reply..
could it be some CentOS specific thing? (NUMA is disabled, as mentioned above)
what else should i test?
|
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
|
0 9019 5 0 0 2 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 1|0 1m 342k 42 ###_replication M 10:56:23
|
0 9387 2 0 0 1 0 22.2g 26.7g 16.8g 0 0 0 0|0 7|0 1m 355k 40 ###_replication M 10:56:24
|
0 9310 2 0 0 2 0 22.2g 26.7g 16.8g 0 0 0 0|0 5|0 1m 376k 40 ###_replication M 10:56:25
|
0 9169 5 0 0 1 0 22.2g 26.7g 16.8g 1 0.1 0 0|0 8|0 1m 357k 41 ###_replication M 10:56:26
|
0 8882 5 0 0 2 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 7|0 1m 332k 41 ###_replication M 10:56:27
|
0 9412 4 0 0 1 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 4|0 1m 352k 39 ###_replication M 10:56:28
|
0 10526 2 0 0 2 0 22.2g 26.7g 16.8g 0 0 0 0|0 4|0 1m 414k 39 ###_replication M 10:56:29
|
0 9132 6 0 0 1 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 8|0 1m 347k 38 ###_replication M 10:56:30
|
0 9375 4 0 0 3 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 6|0 1m 354k 38 ###_replication M 10:56:31
|
0 10090 6 0 0 1 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 1|0 1m 373k 38 ###_replication M 10:56:32
|
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
|
0 10279 4 0 0 12 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 2|0 1m 791k 38 ###_replication M 10:56:33
|
0 10563 5 0 0 1 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 4|0 1m 393k 37 ###_replication M 10:56:34
|
0 10930 2 0 0 2 0 22.2g 26.7g 16.8g 0 0 0 0|0 5|0 1m 413k 38 ###_replication M 10:56:35
|
0 10285 4 0 0 1 0 22.2g 26.7g 16.8g 0 0 0 0|0 8|0 1m 400k 37 ###_replication M 10:56:36
|
0 10670 4 0 0 2 0 22.2g 26.7g 16.8g 0 0.1 0 0|0 2|0 1m 400k 37 ###_replication M 10:56:37
|
there are many other queries running on the server, as you can see by the query count, but none of them get reported in the profiling collection (they are all using indexes)
and this is the query explain, it was run at 10:56:30
db.products.find({solr_indexed: 0}).explain()
|
{
|
"cursor" : "BtreeCursor _solr_indexed_",
|
"nscanned" : 1796,
|
"nscannedObjects" : 1796,
|
"n" : 1796,
|
"millis" : 16,
|
"nYields" : 0,
|
"nChunkSkips" : 0,
|
"isMultiKey" : false,
|
"indexOnly" : false,
|
"indexBounds" : {
|
"solr_indexed" : [
|
[
|
0,
|
0
|
]
|
]
|
}
|
}
|
there is one slow query being reported in the profiling, but i think the speed is reasonable, since it's a big update:
{
|
"ts": ISODate("2012-05-31T08: 20: 36.749Z"),
|
"op": "update",
|
"ns": "###-production.products",
|
"query": {
|
"_id": {
|
"$in": {
|
"0": ObjectId("4ed6143c8c40045aa7483fbf"),
|
"1": ObjectId("4ed613e48c40045aa7483f42"),
|
"2": ObjectId("4ed6140c8c40045aa7483f64"),
|
"3": ObjectId("4e83dd228c40045aa747435b"),
|
"4": ObjectId("4ed614098c40045aa7483f52"),
|
"5": ObjectId("4ed6140a8c40045aa7483f53"),
|
"6": ObjectId("4ed614658c40045aa7483fe2"),
|
"7": ObjectId("4ed614658c40045aa7483fe4"),
|
|
... up to 1000 object IDs
|
|
}
|
}
|
},
|
"updateobj": {
|
"$set": {
|
"solr_indexed": 1
|
}
|
},
|
"keyUpdates": 1,
|
"millis": 211,
|
"client": "###",
|
"user": ""
|
}
|
|
|
|
Dumitru, can you create a new issue with your stats (cpu/mongostat/iostat/etc)?
Nemanja, it looks like you are returning lots of documents and that takes a while. Can you post mongostat numbers when you run the explain again? Please note the start/end time.
|
|
I have a similar issue. High CPU usage ( 2200% - 24 cores ) load < 1 on 2.6.32-220.4.1.el6.x86_64.
Cpu(s): 5.7%us, 81.6%sy, 0.0%ni, 12.2%id, 0.0%wa, 0.0%hi, 0.5%si, 0.0%st
26483 mongo 20 0 51.4g 6.8g 6.2g S 2082.4 3.6 20833:13 mongod
It seems strange that 81% of all CPU is used by system on behalf of mongod.
Another thing I also make queries with "$in", in a replica set. However when I rs.stepDown() the problem server and the other one becomes primary this behavior stops. And for the same amount of queries I get less than 100% usage. Both servers are configured the same way and have similar hardware. This https://jira.mongodb.org/browse/SERVER-3497 is not the issue.
|
> db.products.find({solr_indexed: 0}).explain()
|
{
|
"cursor" : "BtreeCursor _solr_indexed_",
|
"nscanned" : 24804,
|
"nscannedObjects" : 24804,
|
"n" : 24804,
|
"millis" : 399,
|
"nYields" : 1,
|
"nChunkSkips" : 0,
|
"isMultiKey" : false,
|
"indexOnly" : false,
|
"indexBounds" : {
|
"solr_indexed" : [
|
[
|
0,
|
0
|
]
|
]
|
}
|
}
|
there are 893588 documents in total in this collection
i really don't think this is the query that's causing random slowdowns, because it's not run that often..
|
|
It looks like the query may be different (the results clearly are), can you please post the query along with the explain?
|
|
there is an index on this solr_indexed field
and it is using the index when i perform the same query manually:
{
|
"cursor": "BtreeCursor _solr_indexed_",
|
"nscanned": 5551,
|
"nscannedObjects": 5551,
|
"n": 5551,
|
"scanAndOrder": true,
|
"millis": 756,
|
"nYields": 11,
|
"nChunkSkips": 0,
|
"isMultiKey": false,
|
"indexOnly": false,
|
"indexBounds": {
|
"solr_indexed": [
|
[
|
0,
|
0
|
]
|
]
|
},
|
"allPlans": [
|
{
|
"cursor": "BtreeCursor _id_ reverse",
|
"indexBounds": {
|
"_id": [
|
[
|
{
|
"$maxElement": 1
|
},
|
{
|
"$minElement": 1
|
}
|
]
|
]
|
}
|
},
|
{
|
"cursor": "BtreeCursor _solr_indexed_",
|
"indexBounds": {
|
"solr_indexed": [
|
[
|
0,
|
0
|
]
|
]
|
}
|
},
|
{
|
"cursor": "BasicCursor",
|
"indexBounds": [
|
|
]
|
}
|
],
|
"oldPlan": {
|
"cursor": "BtreeCursor _solr_indexed_",
|
"indexBounds": {
|
"solr_indexed": [
|
[
|
0,
|
0
|
]
|
]
|
}
|
}
|
}
|
|
|
This looks like its not using an index:
{
|
"ts": ISODate("2012-05-01T11: 18: 49.664Z"),
|
"op": "query",
|
"ns": "###-production.products",
|
"query": {
|
"query": {
|
"solr_indexed": 0
|
},
|
"$explain": true
|
},
|
"nscanned": 28173,
|
"nreturned": 1,
|
"responseLength": 489,
|
"millis": 1366,
|
"client": "127.0.0.1",
|
"user": ""
|
}
|
|
|
No, as far as I know, all our queries are using indexes..
I started logging slow queries, and here are a few examples:
{
|
"ts": ISODate("2012-05-01T10: 29: 09.885Z"),
|
"op": "update",
|
"ns": "###-production.products",
|
"query": {
|
"_id": {
|
"$in": {
|
"0": ObjectId("4e32e402e8a5c589d27a0827"),
|
"1": ObjectId("4ea753cc8c40045aa7477522"),
|
"2": ObjectId("4e32e406e8a5c589d27a0b10"),
|
"3": ObjectId("4e32e873e8a5c589d27d0604"),
|
"4": ObjectId("4ea753d78c40045aa7477570"),
|
"5": ObjectId("4e32e42ce8a5c589d27a2616"),
|
"6": ObjectId("4e32e42ce8a5c589d27a2621"),
|
"7": ObjectId("4e32e42ce8a5c589d27a2651"),
|
"8": ObjectId("4e32e42ce8a5c589d27a266c"),
|
"9": ObjectId("4e32e42de8a5c589d27a26cf"),
|
...
|
...
|
"998": ObjectId("4e32e477e8a5c589d27a58c3"),
|
"999": ObjectId("4e32e5cde8a5c589d27b5e46")
|
}
|
}
|
},
|
"updateobj": {
|
"$set": {
|
"solr_indexed": 1
|
}
|
},
|
"millis": 8491,
|
"client": "46.4.119.174",
|
"user": ""
|
}
|
|
{
|
"ts": ISODate("2012-05-01T09: 20: 38.418Z"),
|
"op": "update",
|
"ns": "###-production.products",
|
"query": {
|
"_id": ObjectId("4e32efa1e8a5c589d2815247")
|
},
|
"updateobj": {
|
"$set": {
|
"popularity": 26068,
|
"price.93.offer_count": 1,
|
"price.93.min": "4.87",
|
"solr_indexed": 0,
|
"price.323.offer_count": 1,
|
"price.323.min": "4.87"
|
}
|
},
|
"nscanned": 1,
|
"keyUpdates": 3,
|
"millis": 172,
|
"client": "78.46.151.34",
|
"user": ""
|
}
|
|
{
|
"ts": ISODate("2012-05-01T11: 18: 49.664Z"),
|
"op": "query",
|
"ns": "###-production.products",
|
"query": {
|
"query": {
|
"solr_indexed": 0
|
},
|
"$explain": true
|
},
|
"nscanned": 28173,
|
"nreturned": 1,
|
"responseLength": 489,
|
"millis": 1366,
|
"client": "127.0.0.1",
|
"user": ""
|
}
|
|
{
|
"ts": ISODate("2012-05-01T11: 18: 05.201Z"),
|
"op": "command",
|
"ns": "###-production.$cmd",
|
"command": {
|
"count": "products",
|
"query": {
|
"solr_indexed": 0
|
}
|
},
|
"ntoreturn": 1,
|
"responseLength": 48,
|
"millis": 935,
|
"client": "46.4.119.174",
|
"user": ""
|
}
|
both _id and solr_indexed have their own indexes
as you see, it's mostly updates and they're using only _id in the query, so the only thing that makes them slow can be index rebuilding, but sometimes there are even findOne({_id: ObjectId('###')}) queries which time out even after 30 seconds.. this is in my opinion unacceptable for a decent DBMS
example:
[01-May-2012 08:18:05] PHP Fatal error: Uncaught exception 'MongoCursorTimeoutException' with message 'cursor timed out (timeout: 30000, time left: 0:0, status: 0)' in /var/www/html/###/includes/db/mongomongo.php:163
|
Stack trace:
|
#0 /var/www/html/###/includes/db/mongomongo.php(163): MongoCollection->findOne(Array, Array)
|
|
where the query looks like this:
|
{
|
_id: ObjectId('4e32eb22e8a5c589d27e979c')
|
}
|
free -ltm results:
2012/05/01 11:05
|
total used free shared buffers cached
|
Mem: 24022 5704 18317 0 29 5193
|
Low: 24022 5704 18317
|
High: 0 0 0
|
-/+ buffers/cache: 481 23541
|
Swap: 2046 29 2017
|
Total: 26069 5734 20335
|
|
2012/05/01 11:06
|
total used free shared buffers cached
|
Mem: 24022 5708 18313 0 29 5193
|
Low: 24022 5708 18313
|
High: 0 0 0
|
-/+ buffers/cache: 485 23537
|
Swap: 2046 29 2017
|
Total: 26069 5738 20331
|
|
2012/05/01 11:07
|
total used free shared buffers cached
|
Mem: 24022 5706 18315 0 29 5194
|
Low: 24022 5706 18315
|
High: 0 0 0
|
-/+ buffers/cache: 482 23539
|
Swap: 2046 29 2017
|
Total: 26069 5736 20333
|
|
2012/05/01 11:11
|
total used free shared buffers cached
|
Mem: 24022 5740 18281 0 30 5221
|
Low: 24022 5740 18281
|
High: 0 0 0
|
-/+ buffers/cache: 488 23533
|
Swap: 2046 29 2017
|
Total: 26069 5770 20299
|
|
2012/05/01 11:44
|
total used free shared buffers cached
|
Mem: 24022 5922 18100 0 33 5379
|
Low: 24022 5922 18100
|
High: 0 0 0
|
-/+ buffers/cache: 509 23513
|
Swap: 2046 29 2017
|
Total: 26069 5951 20117
|
|
# after running a process which reads all objects in the collection, and updates a few:
|
|
2012/05/01 12:43
|
total used free shared buffers cached
|
Mem: 24022 9291 14731 0 38 8714
|
Low: 24022 9291 14731
|
High: 0 0 0
|
-/+ buffers/cache: 537 23484
|
Swap: 2046 29 2017
|
Total: 26069 9320 16748
|
|
# after running the same process again, but this time updating most of them:
|
|
2012/05/01 13:2
|
total used free shared buffers cached
|
Mem: 24022 9652 14369 0 42 8912
|
Low: 24022 9652 14369
|
High: 0 0 0
|
-/+ buffers/cache: 698 23323
|
Swap: 2046 21 2025
|
Total: 26069 9674 16395
|
|
|
MMS shows that you are running very hot on the CPU. Do you have any slow queries which aren't using indexes? I would guess you need to tune your indexes for your queries since your cpu is so high. Can you enable database profiling to look for slow queries? http://www.mongodb.org/display/DOCS/Database+Profiler
When you get a timeout on findOne(...) can you do a find(...).explain() and log the results? It will show more information about what is going on at that moment.
Can you tell me a bit more about the memory usage by running free -ltm periodically?
|
|
is this the maximum mongodb can do on such a machine?
|
|
i have now attached the MMS stats when i start (around 18:04) the rest of the scripts that query / update the collection and run in parallel
|
|
i've enabled the monitoring now..
can you see the data on your own, or should i attach screenshots?
|
|
Can you enable munin on your monitored hosts (MMS)? http://mms.10gen.com/help/install.html#hardware-monitoring-with-munin-node
|
Generated at Thu Feb 08 03:08:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.