[SERVER-5491] Configurable balancer delay parameter Created: 03/Apr/12  Updated: 06/Dec/22  Resolved: 17/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Greg Studer Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Won't Do Votes: 0
Labels: lamont-triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-5490 Balancer delay doesn't really delay v... Closed
Assigned Teams:
Sharding EMEA
Participants:

 Description   

delayMs in config.settings(

{ _id : balancer }

)?



 Comments   
Comment by Alexander Komyagin [ 02/May/14 ]

Another setup where this functionality would have helped.

The FROM shard has 4 nodes: one primary and 3 secondaries. 2 secondaries are somewhat slower that another one. With secondaryThrottle enabled the deletes are throttled by the fastest secondary, eventually causing 2 slower secondaries to be overloaded by the rate of deletes that they can't sustain.

Comment by Kevin J. Rice [ 15/Mar/13 ]

From a user perspective: when we have a constant very high load, we can get unbalanced and become more so. Once a shard has more chunks, it gets more activity, which generates more chunks, etc. (it's dynamically unstable).

I can suggest a radioactive-decay model where the longer it goes unbalanced the higher the priority placed on balancing vs. writes. Aggressiveness could then be derived/tuned using heuristics from your MMS service's data.

Comment by Eliot Horowitz (Inactive) [ 04/Apr/12 ]

i like the idea of a balancer aggressive metric
0 to 10
no units

that way we can tune parameters based on it, but the parameters could change over time, etc...

Comment by Greg Studer [ 04/Apr/12 ]

Agree that making things smarter would be helpful, but still think a general "balancer aggressiveness" parameter is needed, because we have all kinds of customer apps that can tolerate more-or-less interruption. Any set of benchmarks we choose is going to have issues (for the same reason that we don't publish benchmarks of our own, there are too many system-specific issues).

Comment by Eliot Horowitz (Inactive) [ 04/Apr/12 ]

I'm pretty opposed to a delay parameter.

We should figure out why this is needed and then address that.

i.e. wait until queues sizes go back to normal, or replication catches up.

Generated at Thu Feb 08 03:09:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.