[SERVER-3236] High CPU wait load on capped collection reaching it's limit Created: 09/Jun/11  Updated: 12/Jul/16  Resolved: 02/Sep/11

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Michael Korbakov Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux ip-10-164-19-105 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

EC2 m2.2xlarge instance


Operating System: Linux
Participants:

 Description   

I have very strange behavior of mongo on out Nimble server. We use mongodb to run cache for our product. Cache consists of single capped collection with few indexed fields. Here's the stats for it:

db> db.cache.stats()
{
"ns" : "nimble_cache.cache",
"count" : 6795240,
"size" : 29891215048,
"avgObjSize" : 4398.8461111013,
"storageSize" : 30000000304,
"numExtents" : 15,
"nindexes" : 3,
"lastExtentSize" : 256,
"paddingFactor" : 1,
"flags" : 0,
"totalIndexSize" : 3472837440,
"indexSizes" :

{ "cache_id_-1_created_-1" : 561824704, "depends_on_1_deadline_-1" : 2351080384, "cache_id_1_resource_1_created_-1" : 559932352 }

,
"capped" : 1,
"max" : 2147483647,
"ok" : 1

}

It runs on m2.2xlarge EC2 instance that have 34 GBs of memory. Here's
memory stats of the mongo instance:

> db.serverStatus().mem

{ "bits" : 64, "resident" : 6300, "virtual" : 34749, "supported" : true, "mapped" : 32960 }

The load on server isn't really high (lock percentage is terrible however):
[root@ip-10-164-1-195 ~]# /opt/mongo/bin/mongostat
connected to: 127.0.0.1
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
93 368 89 0 0 86 0 32.2g 33.9g 6.2g 7 39.8 0 5|7 5|8 444k 684k 151 22:56:44
66 302 85 0 0 90 0 32.2g 33.9g 6.2g 3 77.4 0 0|0 0|0 325k 504k 151 22:56:45
47 153 43 0 0 44 0 32.2g 33.9g 6.2g 4 77.1 0 0|0 0|0 327k 386k 151 22:56:46
67 256 70 0 0 66 0 32.2g 33.9g 6.2g 4 35.9 0 6|8 6|9 329k 437k 151 22:56:47
103 384 104 0 0 106 0 32.2g 33.9g 6.2g 5 60.9 0 6|6 6|7 449k 755k 151 22:56:48
56 207 71 0 0 75 0 32.2g 33.9g 6.2g 5 53.3 0 5|6 5|7 390k 623k 151 22:56:49
47 201 40 0 0 39 0 32.2g 33.9g 6.2g 4 78.1 0 4|4 4|5 348k 473k 151 22:56:50
99 381 117 0 0 121 0 32.2g 33.9g 6.2g 5 27.6 0 0|0 0|0 445k 766k 151 22:56:51
78 308 89 0 0 90 0 32.2g 33.9g 6.21g 5 39.3 0 0|0 0|0 461k 657k 151 22:56:52
115 374 122 0 0 120 0 32.2g 33.9g 6.21g 8 4.7 0 3|7 3|8 484k 647k 151 22:56:54

All updates are on non-indexed fields. All queries are using indexes.

The problem I have is very high disk IO activity. That causes that locks level above. Here's the output of iostat:

[root@ip-10-164-1-195 ~]# iostat -xdm /dev/sdk 2
Linux 2.6.21.7-2.fc8xen (ip-10-164-1-195) 06/07/11

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdk 0.05 259.76 83.07 407.36 2.52 5.83 34.86 43.29 88.28 1.63 79.93
sdk 0.00 0.00 6.00 437.00 0.08 1.74 8.42 105.36 260.29 2.26 99.90
sdk 0.00 0.00 6.00 741.00 0.06 3.03 8.47 39.87 67.59 1.32 98.95
sdk 0.00 4.00 25.00 2519.00 0.17 10.07 8.24 80.18 31.24 0.39 98.30
sdk 0.00 25.00 14.00 1530.00 0.14 6.33 8.59 23.92 15.76 0.64 98.10
sdk 0.00 2.50 6.50 422.50 0.07 1.73 8.57 55.19 96.98 2.32 99.60
sdk 0.00 0.50 11.50 135.00 0.11 0.54 9.17 44.41 363.53 6.82 99.95
sdk 0.00 0.00 8.00 104.00 0.08 0.41 8.89 108.89 617.02 8.93 100.05
sdk 0.00 0.00 14.50 270.50 0.13 1.08 8.65 69.87 390.25 3.51 100.00
sdk 0.00 0.00 6.00 154.00 0.05 0.61 8.45 51.68 305.67 6.25 100.05
sdk 0.00 0.00 9.50 204.50 0.07 0.86 8.93 24.03 139.49 4.67 99.90

I have trying disabling journaling or increasing syncdelay parameter (now it's set to 300 seconds). No positive results so far. The one thing I mentioned is that after mongo restart everything works pretty good for about ~10-15 minutes. However after that period of time disk IO starts raising and in few minutes device utilization reaches 100%.

I'm really out of ideas of how to deal with it.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ]

Thanks

Comment by Michael Korbakov [ 02/Sep/11 ]

The problem turned out to be in EBS snapshots that we're using for backups. This cache collection was under heavy updates and snapshotting is turning some heavy IO load on it. We solved the problem by reducing the frequency of snapshots.

Comment by Eliot Horowitz (Inactive) [ 30/Aug/11 ]

Any update from your side?

Comment by Michael Korbakov [ 10/Jun/11 ]

No there's no any other collections on this database.
There's single kind of updates running on this collection, query part is using index, update part is $set on non-indexed field.

We already tried profiling. There's three different queries run on this collection: insert of new cache record, invalidation of cache record and fetching cache record. Query of each types were reported as "slow" by profiling at different times.

One thing that I've forgot to mention in the initial problem description: we're running backups using EBS snapshots. We don't stop Mongo server and do this snapshot "on the fly", relying on journaling for recovery. May be these backups contribute something to the problem.

Unfortunately I can't get actual mongostat/iostat numbers now – we're experimenting by ourselves trying to solve this issue and current configuration isn't the one described above. We'll get back to original configuration with capped collection after weekends, I'll attach this stats then.

Comment by Scott Hernandez (Inactive) [ 09/Jun/11 ]

Aside from the capped collection are there other collections which are on the same servers? What updates are you doing?

Can you turn on database profiling and report on the some of the slowest operations which you see? http://www.mongodb.org/display/DOCS/Database+Profiler

Also, can you provide a few minutes of mongostat/iostat numbers (as an attachment)?

Generated at Thu Feb 08 03:02:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.