[SERVER-27393] Balancer taking 100% CPU due to large number of dropped sharded collections Created: 13/Dec/16  Updated: 05/Apr/17  Resolved: 21/Dec/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.0
Fix Version/s: 3.4.2, 3.5.2

Type: Bug Priority: Major - P3
Reporter: Isaac Cruz Assignee: Nathan Myers
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod.log.bz2    
Issue Links:
Backports
Related
related to SERVER-27474 Eliminate "dropped" collections from ... Closed
related to SERVER-27475 mongos should request an update only ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Sharding 2017-01-02
Participants:

 Description   

Balancer started to take 100% CPU after upgrading to 3.4.0 from 3.2.9 and enabling the balancer. This is a DB with 4 shards (rs1, rs2, rs3 and rs4), but before upgrading we removed one shard (rs5), waiting until the drain completed.

In the log I can see warnings like these for several collections:

2016-12-13T09:52:11.211+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008
2016-12-13T09:52:13.087+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009
2016-12-13T09:53:38.583+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008
2016-12-13T09:53:40.360+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009

Those collections are created and dropped after some days. And indeed those collections were dropped and are not shown in "db.getCollectionNames()".

I investigated a bit and found those collections in config DB:

{ "_id" : "eplus.wifiCollection_20161008", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-18T04:00:13.108Z"), "dropped" : true }
{ "_id" : "eplus.wifiCollection_20161009", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-19T04:00:48.158Z"), "dropped" : true }

And there are locks for many collections related to the removed shard (rs5):

{ "_id" : "eplus.wifiCollection_20160908", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d384"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.235Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160908" }
{ "_id" : "eplus.wifiCollection_20160909", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d38b"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.296Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160909" }

Not only there are locks for dropped collections but also for existant collections. Our guessing is that this is causing Balancer to continuously loop over all collections, and thus causing 100% CPU, but we are not sure how to work around.



 Comments   
Comment by Githook User [ 22/Dec/16 ]

Author:

{u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}

Message: SERVER-27393 Balancer should skip dropped collections
Branch: v3.4
https://github.com/mongodb/mongo/commit/81e1e3cc42a592cdb9e3ca32a43eec2a3790a18f

Comment by Githook User [ 21/Dec/16 ]

Author:

{u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}

Message: SERVER-27393 Balancer should skip dropped collections
Branch: master
https://github.com/mongodb/mongo/commit/ba15d5ba13359ba0aedbb58ee10cfc919fd9e0f6

Comment by Isaac Cruz [ 14/Dec/16 ]

Just a note: with just one "dropped" collection in config.collections, the balancer keeps retrying with a pause of 10 seconds. As soon as I removed all entries with dropped=true, there are no more logs of Balancer in cfg server, and CPU usage reduces to 1%.

Comment by Kaloian Manassiev [ 13/Dec/16 ]

Summary of the problem:
The sharding balancer uses the entries in config.collections as means to decide which collections to balance. This is done without any predicate on the state of the collection and means that we try to balance dropped collections.

For customers which frequently drop and recreate collections, this causes load on the server due to the need to issue a chunks query for each collection. We should make the balancer skip dropped namespaces.

Comment by Kaloian Manassiev [ 13/Dec/16 ]

Thanks icruz. Please watch this ticket for updates on when the fix becomes available.

Best regards,
-Kal.

Comment by Isaac Cruz [ 13/Dec/16 ]

Hi Kal,

thanks for your help, after removing dropped collections CPU load went back to normal. There were only 47 dropped collections. but cfg server is in a t2.micro from Amazon so it has very little CPU available.

I'm attaching the log from config server, only for 1 hour since it's essentially repeating.

Comment by Kaloian Manassiev [ 13/Dec/16 ]

Hi icruz,

Thank you for the report and the investigation.

Just like you noticed, it looks like the balancer code is not ignoring the deleted collections (those with dropped:true) and when we try to treat them as sharded we get an error, because there are no chunks for them. This error is benign, but since you most likely have a lot of these dropped collections it is causing a load on your server because of the number of queries the config server has to make.

We are going to fix this by skipping dropped collections. In the mean time, you can work around this problem by manually dropping all collection entries with dropped:true using this command:

mongos> use config;
mongos> db.collections.remove({dropped:true});
WriteResult({ "nRemoved" : 1000 })

Do you mind letting us know how many dropped collections do you have just so we can write a realistic test for this problem (I created a database with 1000 dropped collection and was unable to get such a high CPU utilization):

mongos> use config;
mongos> db.collections.count({dropped:true});
1000

About your question for the locks - these locks have state:0 which means "unlocked". These records never get cleaned up and that's why you are seeing them.

Finally, would it be possible that you attach the log from the config server primary so we can have a look at all the errors which you are getting?

Thank you in advance.

Best regards,
-Kal.

Generated at Thu Feb 08 04:15:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.