[SERVER-68777] BalancerCollectionStatus may report balancerCompliant too early Created: 12/Aug/22  Updated: 29/Sep/22  Resolved: 29/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Allison Easton Assignee: Silvia Surroca
Resolution: Won't Fix Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-67301 Balancer may perform one unnecessary ... Closed
is related to SERVER-69136 Tests should consider balancerCollect... Closed
Operating System: ALL
Sprint: Sharding EMEA 2022-09-05, Sharding EMEA 2022-09-19, Sharding EMEA 2022-10-03
Participants:
Linked BF Score: 11

 Description   

SERVER-67301 tried to fix the problem that the balancer may make wrong decisions based on the amount of data because of the non transactionality of deleting orphans and updating the orphan counter. However, there is another problem related to this.

Suppose we have 8.1MB of documents and 1MB orphans (total of 9.1MB) on one shard and 5MB of documents on a second shard (with a max chunk size of 1MB)

If the orphans are deleted but the orphan counter is not updated when the balancer asks for datasize, the first shard will report 8.1MB of data minus 1MB orphans (7.1MB) and the second shard will report 5MB data. The balancer will return balancerCompliant since the difference in data is 2.1MB which is less than 3 * 1MB.

Then the orphan counter will be updated and the next round, the first shard will report 8.1MB of data (no orphans) and the second shard will report 5MB of data. The difference is 3.1MB of data and so a chunk will need to be moved from the first shard to the second shard.

On the resolution of this ticket, take into account to remove the extra check of the collection balance status on testing function awaitCollectionBalance.



 Comments   
Comment by Silvia Surroca [ 29/Sep/22 ]

We've decided to don't fix this issue since the solution would slow down the balancer.
In more detail, the proper solution would execute atomically the remove of a range deletion and the update of the orphan counter from the balancer point of view, so that the balancer performance would be adversely affected.

Generated at Thu Feb 08 06:11:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.