Balancer does not make progress when the most loaded shard is already balanced within its zones

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 6.0.3, 7.0.0, 8.0.0, 8.2.0
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
    • Hide

      Shard1 [Zone_US] 500 GB
      Shard2 [Zone_EU] 300 GB
      Shard3 [Zone_EU] 100 GB

      In this scenario, the balancer will fail to balance Zone_EU and it will not move any chunks. Instead it should mvoe 100GB from Shard2 to Shard3

      Show
      Shard1 [Zone_US] 500 GB Shard2 [Zone_EU] 300 GB Shard3 [Zone_EU] 100 GB In this scenario, the balancer will fail to balance Zone_EU and it will not move any chunks. Instead it should mvoe 100GB from Shard2 to Shard3
    • CAR Team 2026-01-05
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      The balancer does not make progress in certain scenarios where the most loaded shard belongs to a balanced zone, because it keeps selecting that shard as donor even when all shards in its zone are already balanced, and then fails to find a suitable recipient since the remaining underloaded shards belong to different zones.

      Details

      When the cluster has zones configured and the most overloaded shard (by data size) is in a zone that is already internally balanced, the balancer repeatedly tries to move chunks from that shard.

      However, since the other shards in the same zone are already balanced, there are no valid chunk candidates that can be donated while still honoring the existing zone configuration. As a result:

      • The balancer keeps choosing the most loaded shard as the donor
      • No migrations are actually performed, so the overall balancing does not make progress
      • Zones themselves are respected at all times; the issue is with donor selection and progress when the top candidate shard cannot actually donate any chunks

      Impact

      Balancer rounds can appear to be “stuck” or not making progress, even though the system is correctly enforcing the configured zones.
      This mainly affects situations where:

      • One shard is globally the most loaded shard
      • That shard is in a zone that is already locally balanced
      • Other zones may remain unbalanced

      Expected Behavior

      If the most loaded shard in a zone cannot donate any further chunks without violating zone constraints, the balancer should:

      • Skip it as a donor candidate for that round, and
      • Consider other shards/zones where valid migrations would still respect the zone configuration and effectively reduce imbalance.

            Assignee:
            Pierlauro Sciarelli
            Reporter:
            Tommaso Tocci
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: