[SERVER-10024] cluster can end up with large chunks that did not get split and will time out on migration Created: 25/Jun/13  Updated: 14/Apr/23  Resolved: 14/Apr/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Antoine Girbal Assignee: Pierlauro Sciarelli
Resolution: Done Votes: 1
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-44088 Autosplitter seems to ignore some fat... Closed
Assigned Teams:
Sharding EMEA
Sprint: Sharding EMEA 2023-04-17
Participants:

 Description   

Consider the case where:

  • large volume of insertion
  • migration is slow due to slow hardware and many indices (e.g. 20)
  • consequently moveChunk operation takes a long time (e.g. 1 min)
  • consequently any split fail during that time since the ns is locked, and chunks become larger.
  • consequently chunks become even longer to move... This downward spiral makes thing worse and worse
  • eventually chunks cannot be moved at all. The migration gets aborted after some minutes and no progress is made at all. But the system is super busy all the time trying to migrate those documents.

I think we need several server improvements:

A. any chunk migration abort due to timeout should result in a split. If anything the split wont hurt. Right now the split seems to be for a specific case only.

B. ideally the migration process would avoid retrying the same chunk over and over. May need some amount of randomization on candidate chunks.

C. when mongos fails to split due to NS locked, it should mark the metadata as "needs split" for later. Ideally all "need split" should be cleared before the next migration is attempted.

This is all to avoid the bad catch 22 problems where large chunks end up clogging the whole system.



 Comments   
Comment by Pierlauro Sciarelli [ 14/Apr/23 ]

Closing this ticket as gone away because:

  • The description is referring a deprecated auto-splitter behavior only present in versions that reached EOL long ago, when the component was on routers.
  • The described problem is surely not present in currently supported versions because: the chunk splitter has been moved on shards in v4.2, that is currently the lowest supported version (not for long, going EOL this month)

As a side note, the auto-splitter has gone away starting from v6.0 so the pre-splitting solution proposed by Nic is not viable anymore. Quoting 6.0 release notes:

Starting in MongoDB 6.0.3, data in sharded clusters is distributed based on data size rather than number of chunks. As a result, you should be aware of the following significant changes in sharded cluster data distribution behavior:

  • The balancer distributes ranges of data rather than chunks. The balancing policy looks for evenness of data distribution rather than chunk distribution.
  • Chunks are not subject to auto-splitting. Instead, chunks are split only when moved across shards.
Comment by Nic Cottrell [ 27/Jan/20 ]

In regards to problems with slow migration during special insert workloads, one good solution is to manually split chunks prior to starting the workload and allowing the balancer to re-balance the new empty chunks. When you start the import workload, the inserts should then be distributed across all available shards rather than all being send to a single "hot shard" and then migrated in a subsequent step.

Comment by Kevin J. Rice [ 06/Aug/13 ]

I'm seeing this behaviour when doing mongorestore of a large database. I end up with a bunch of unbalanced shards (we have 48 shards) that are not splitting because mongorestore is eating all the IO. So, balancer isn't going, splitter sometimes fails, etc. I end up having to stop the mongorestore, restart daemons/mongos processes, stopBalancer()/startBalancer()/setBalancerState(false)-wait-2-minutes-setBalancerState(true), etc., etc., until the balancer decides to start working, wait for it to balance, then start the mongorestore again with a properly split and balanced set of data.

Generated at Thu Feb 08 03:22:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.