[SERVER-44172] Chunk balancer should account for total chunks per shard Created: 23/Oct/19  Updated: 04/Dec/19  Resolved: 04/Dec/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.12
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Aaron Westendorf Assignee: Carl Champain (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

We had an existing database cluster of 2 shards that backed a multi-tenant testing environment. That environment had the following:

 

dbs: 541
collections: 31560
chunks: 144372

That is, many of these databases contained over 50 collections, each representing a different supported feature, but each with very little data. Thus, they were subject to the default chunk count at the time they were created; 4 chunks, 2 on each of the shards.

As this test environment continues to grow, we found we needed to add a 3rd shard. We did, and that moved some chunks, but then it stopped. At this time, here is the chunk distribution.

{ "_id" : "shard-1", "count" : 67574 }
{ "_id" : "shard-2", "count" : 66715 }
{ "_id" : "shard-3", "count" : 10413 }

After confirming that the balancer is working properly, we think we've isolated this to a missing heuristic in the balancer.

The balancer will balance each collection, and we've observed that working properly for years. However, it does not take into account the total number of chunks per shard, and so in a situation such as this, it does not properly balance these smaller collections across the cluster.



 Comments   
Comment by Carl Champain (Inactive) [ 04/Dec/19 ]

Hi aaron.westendorf,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Carl
 

Comment by Carl Champain (Inactive) [ 11/Nov/19 ]

Hi aaron.westendorf,

Any updates on this issue?

Comment by Carl Champain (Inactive) [ 29/Oct/19 ]

Hi aaron.westendorf,

Could you also provide a mongodump of the config server database? Please use the secure uploader in my first comment.

We suspect that the described issue is a known behavior with the balancer as it currently looks at one collection at a time only.

Thank you,
Carl

Comment by Carl Champain (Inactive) [ 25/Oct/19 ]

Hi aaron.westendorf,

Thanks for taking the time to submit this report.

A few questions to help us going forward:
1. Can you share the existing config server primary logs covering this behavior's timeframe?
2. Can you share the config server primary logs with the "sharding" component verbosity at level 1? Use this configuration for three minutes, then you can set the level back to what you were using.

db.setLogLevel(1, "sharding")

3. Can you run sh.status() in mongos and also share the output?

Please upload your files to our secure uploader here. Only MongoDB engineers can view these files and they will expire after a period time.
 
Kind regards,
Carl

Generated at Thu Feb 08 05:05:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.