[SERVER-52886] Sharded cluster is balancing a lot and creating a lot of very small chunks Created: 16/Nov/20  Updated: 05/Aug/21  Resolved: 15/Jul/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.4.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Henri-Maxime Ducoulombier Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-55028 Improve the auto-splitter policy Closed
Related
Operating System: ALL
Participants:

 Description   

We recently (mid-october) migrated our production sharded cluster to 4.4.1 and we are experiencing a strange behavior with the balancer.

When we started the new cluster, we calculated the number of initial chunks using the default 64MB chunksize, and we restored the collection after a pre-split operation.

So we started with 2700 chunks, hashed shard key, for a collection of 170GB and 16M+ documents. Average document size in collection is 10KiB and consistent across the shards.

On our dev/test shard, here is a shard detail:

Shard shard-local-rs05 at XXXX
 data : 28.19GiB docs : 2766359 chunks : 450
 estimated data per chunk : 64.16MiB
 estimated docs per chunk : 6147

The problem is that, in production, the balancer has been running like crazy and we now have almost 60K of very small chunks.

 data : 33.16GiB docs : 3324684 chunks : 11908
 estimated data per chunk : 2.85MiB
 estimated docs per chunk : 279
 
Totals
 data : 170.85GiB docs : 17131823 chunks : 59542
 Shard shard-04 contains 20.04% data, 20.01% docs in cluster, avg obj size on shard : 10KiB
 Shard shard-01 contains 20.25% data, 20.28% docs in cluster, avg obj size on shard : 10KiB
 Shard shard-02 contains 20.51% data, 20.51% docs in cluster, avg obj size on shard : 10KiB
 Shard shard-03 contains 19.77% data, 19.77% docs in cluster, avg obj size on shard : 10KiB
 Shard shard-05 contains 19.41% data, 19.4% docs in cluster, avg obj size on shard : 10KiB

Please not that we did NOT change the chunk size, and that the config collection does not include any document with _chunksize.

Can anybody please advise on what we should do and where to investigate ?

I'm afraid this could lead to performance issues or worse, unavailability in case of hard limits in chunks count or size.



 Comments   
Comment by Pierlauro Sciarelli [ 05/Aug/21 ]

Hi hmducoulombier@marketing1by1.com, thank you for detailing so deeply the use case.

From your description, it looks like chunks were split always in an unfair way: a chunk of size maxChunkSize / 2 + ε was being split in two chunks left (size a bit less than maxChunkSize / 2) and right (containing very few documents, potentially just one). As a consequence, even just one insert in chunk left would trigger an additional split, same for subsequent inserts on the new left chunk and so on. This way, it's easy to end up in the pathological situation you described.

This behavior has been caused by the 2 reasons listed below (both going to be solved under SERVER-55028). Keep in mind that the chunk splitter doesn't usually gets triggered on every insert, but this can potentially happen if the CPU is fairly free, so I am describing the very worst case.

1) The policy to choose split points (not user-configurable, part of splitVector)

Split every N keys, with N being the number of keys estimated to correspond to maxChunkSize / 2. It's a fairly distributive policy in case of chunks being split in a lot of smaller chunks (e.g. after re-enabling the balancer), but a very bad choice in case of chunks being continuously split in 2 (e.g. if the balancer has always been enabled).

2) The hashed shard key

Let's say that N is 100.

  • (OK case) Given a (non-hashed) monotonically increasing shard key, we would split every 101 inserts in two chunks left and right containing respectively 100 and 1 documents. The right chunk would always grow, so it would be the next one to be split always at size N.
  • (Not OK case) Given an hashed shard key, new inserts would not always go in the right (last) chunk, so intermediate chunks would keeping on being split in two chunks of 100 and 1 documents. The number of chunks is much higher of the non-hashed case and.

PS: I am not encouraging in any way to use non-hashed monotonically increasing shard keys as it may affect scalability. This problem is going to be solved soon and anyway it is very rare to end up in the described pathological situation.

Comment by Eric Sedor [ 15/Jul/21 ]

Hi hmducoulombier@marketing1by1.com, I wanted to get back to you to let you know I'm going to deduplicate this ticket against SERVER-55028. As always, thanks for your patience.

High counts of small chunks could be undesirable depending on whether or not the balancer is having trouble keeping up with chunk migration. But by itself it is not usually a cause for concern. That said, it does make sense for us to be ensuring that the system is attempting to make use of up to the max chunk size.

In SERVER-55028, we're starting to consider broad improvements to chunk split logic that will focus on that goal.

Be well!

Comment by Eric Sedor [ 15/Jan/21 ]

Hi hmducoulombier@marketing1by1.com and happy new year. I wanted to see if you saw my last message and can provide that context. Thank you!

Comment by Henri-Maxime Ducoulombier [ 24/Nov/20 ]

Hello Eric,

Thank you for your reply.

I uploaded several files using the secure portal and the CURL command.

A mongod config server log (the primary) -> mongod-support1.log
A mongod shard server log (4th shard primary mongod)
The output of sh.status()
The output of getShardDistribution() command on the main collection having issue.

Please note, regarding the sh.status() result, that another db in the same cluster has the same issue.

Our Mongod logs are all recorded and uploaded to cloudwatch and I can run requests to search for a given pattern if needed.

Henri-Maxime

Comment by Eric Sedor [ 24/Nov/20 ]

Hi Henri-Maxime Ducoulombier,

In general, the best place to start to narrow down an issue about sharding behavior is the MongoDB Developer Community Forums. While a specific bug or needed feature could be involved in any given issue, it often helps to walk through the expected behavior of a sharded cluster, which can vary widely by use-case.

That said, 279 docs per chunk does look low to me. Can you upload to this secure upload portal the output of sh.status() and the mongod logs from a specific shard?

I'll take a look to see if anything is obviously amiss.

Gratefully,
Eric

Generated at Thu Feb 08 05:29:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.