[SERVER-27393] Balancer taking 100% CPU due to large number of dropped sharded collections Created: 13/Dec/16 Updated: 05/Apr/17 Resolved: 21/Dec/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.4.0 |
| Fix Version/s: | 3.4.2, 3.5.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Isaac Cruz | Assignee: | Nathan Myers |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v3.4
|
||||||||||||||||
| Sprint: | Sharding 2017-01-02 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Balancer started to take 100% CPU after upgrading to 3.4.0 from 3.2.9 and enabling the balancer. This is a DB with 4 shards (rs1, rs2, rs3 and rs4), but before upgrading we removed one shard (rs5), waiting until the drain completed. In the log I can see warnings like these for several collections:
Those collections are created and dropped after some days. And indeed those collections were dropped and are not shown in "db.getCollectionNames()". I investigated a bit and found those collections in config DB:
And there are locks for many collections related to the removed shard (rs5):
Not only there are locks for dropped collections but also for existant collections. Our guessing is that this is causing Balancer to continuously loop over all collections, and thus causing 100% CPU, but we are not sure how to work around. |
| Comments |
| Comment by Githook User [ 22/Dec/16 ] | ||||||
|
Author: {u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}Message: | ||||||
| Comment by Githook User [ 21/Dec/16 ] | ||||||
|
Author: {u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}Message: | ||||||
| Comment by Isaac Cruz [ 14/Dec/16 ] | ||||||
|
Just a note: with just one "dropped" collection in config.collections, the balancer keeps retrying with a pause of 10 seconds. As soon as I removed all entries with dropped=true, there are no more logs of Balancer in cfg server, and CPU usage reduces to 1%. | ||||||
| Comment by Kaloian Manassiev [ 13/Dec/16 ] | ||||||
|
Summary of the problem: For customers which frequently drop and recreate collections, this causes load on the server due to the need to issue a chunks query for each collection. We should make the balancer skip dropped namespaces. | ||||||
| Comment by Kaloian Manassiev [ 13/Dec/16 ] | ||||||
|
Thanks icruz. Please watch this ticket for updates on when the fix becomes available. Best regards, | ||||||
| Comment by Isaac Cruz [ 13/Dec/16 ] | ||||||
|
Hi Kal, thanks for your help, after removing dropped collections CPU load went back to normal. There were only 47 dropped collections. but cfg server is in a t2.micro from Amazon so it has very little CPU available. I'm attaching the log from config server, only for 1 hour since it's essentially repeating. | ||||||
| Comment by Kaloian Manassiev [ 13/Dec/16 ] | ||||||
|
Hi icruz, Thank you for the report and the investigation. Just like you noticed, it looks like the balancer code is not ignoring the deleted collections (those with dropped:true) and when we try to treat them as sharded we get an error, because there are no chunks for them. This error is benign, but since you most likely have a lot of these dropped collections it is causing a load on your server because of the number of queries the config server has to make. We are going to fix this by skipping dropped collections. In the mean time, you can work around this problem by manually dropping all collection entries with dropped:true using this command:
Do you mind letting us know how many dropped collections do you have just so we can write a realistic test for this problem (I created a database with 1000 dropped collection and was unable to get such a high CPU utilization):
About your question for the locks - these locks have state:0 which means "unlocked". These records never get cleaned up and that's why you are seeing them. Finally, would it be possible that you attach the log from the config server primary so we can have a look at all the errors which you are getting? Thank you in advance. Best regards, |