[SERVER-5491] Configurable balancer delay parameter Created: 03/Apr/12 Updated: 06/Dec/22 Resolved: 17/Dec/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Greg Studer | Assignee: | [DO NOT USE] Backlog - Sharding EMEA |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | lamont-triage | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||
| Participants: | |||||||||
| Description |
|
delayMs in config.settings( { _id : balancer })? |
| Comments |
| Comment by Alexander Komyagin [ 02/May/14 ] |
|
Another setup where this functionality would have helped. The FROM shard has 4 nodes: one primary and 3 secondaries. 2 secondaries are somewhat slower that another one. With secondaryThrottle enabled the deletes are throttled by the fastest secondary, eventually causing 2 slower secondaries to be overloaded by the rate of deletes that they can't sustain. |
| Comment by Kevin J. Rice [ 15/Mar/13 ] |
|
From a user perspective: when we have a constant very high load, we can get unbalanced and become more so. Once a shard has more chunks, it gets more activity, which generates more chunks, etc. (it's dynamically unstable). I can suggest a radioactive-decay model where the longer it goes unbalanced the higher the priority placed on balancing vs. writes. Aggressiveness could then be derived/tuned using heuristics from your MMS service's data. |
| Comment by Eliot Horowitz (Inactive) [ 04/Apr/12 ] |
|
i like the idea of a balancer aggressive metric that way we can tune parameters based on it, but the parameters could change over time, etc... |
| Comment by Greg Studer [ 04/Apr/12 ] |
|
Agree that making things smarter would be helpful, but still think a general "balancer aggressiveness" parameter is needed, because we have all kinds of customer apps that can tolerate more-or-less interruption. Any set of benchmarks we choose is going to have issues (for the same reason that we don't publish benchmarks of our own, there are too many system-specific issues). |
| Comment by Eliot Horowitz (Inactive) [ 04/Apr/12 ] |
|
I'm pretty opposed to a delay parameter. We should figure out why this is needed and then address that. i.e. wait until queues sizes go back to normal, or replication catches up. |