[SERVER-14323] Balancer hits config servers hard even though there are no writes Created: 20/Jun/14  Updated: 20/Jul/16  Resolved: 20/Jul/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.3, 3.0.12, 3.2.8, 3.3.8
Fix Version/s: 3.3.9

Type: Bug Priority: Major - P3
Reporter: Jan Ježek Assignee: Kaloian Manassiev
Resolution: Done Votes: 2
Labels: balancer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongos.log.gz    
Issue Links:
Depends
depends on SERVER-22672 Move the sharding balancer to CSRS pr... Closed
Duplicate
is duplicated by SERVER-14810 Balancer shouldn't need to load the f... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

In my case, I have a database in 2 shards with slightly less than 30M documents in about 130k chunks. I have 6 mongos instances and, each reading from the first config server at about 20 mbit/s which is 120mbit/s in total. I have to control the balancer manually due to this.

Participants:

 Description   

Every mongos instance starts a balancer round every few seconds. While looking for candidate chunks to move, it always reads all of the chunks with the given ns from the config database, many times only to realize that the collection is in fact well balanced. This may cause a significant network load.



 Comments   
Comment by Kaloian Manassiev [ 20/Jul/16 ]

With the resolution of SERVER-22672, the sharding balancer runs only on the CSRS primary and uses a single cached version of the chunks with incremental updates.

Comment by Matthieu Rigal [ 18/Dec/15 ]

I can also confirm a very similar problem, with unnecessary high network load!

Comment by Greg Studer [ 03/Jul/14 ]

Confirmed - with chunk diffing changes, the balancer chunk reload is more expensive than it needs to be.

Comment by Jan Ježek [ 23/Jun/14 ]

This is the query that causes the network load:
https://github.com/mongodb/mongo/blob/v2.6.3/src/mongo/s/balance.cpp#L280

Comment by Jan Ježek [ 23/Jun/14 ]

Sorry about the version confusion. I have looked at the source code to confirm my assumptions and that was in the v2.6.3 branch. In production we actually use a 2.4 that comes with Debian.
I have spawned a new temporary mongos for the cluster with a -vvvvv parameter. The log is attached. It is run against a similar cluster running in our staging environment, which is a bit smaller than production, nonetheless the problem is also visible.

Comment by Jan Ježek [ 23/Jun/14 ]

Log from a temporary mongos process run with -vvvvv

Comment by Thomas Rueckstiess [ 20/Jun/14 ]

Hi Jan,

Can you please attach the log file of the mongos that is doing the balancing?

Also just to confirm: Version 2.6.3 has not been released yet. Are you running 2.6.2?

Thanks,
Thomas

Generated at Thu Feb 08 03:34:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.