[SERVER-37810] Optimise balancer performance with zone sharding Created: 30/Oct/18 Updated: 29/Sep/23 Resolved: 28/Sep/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Josef Ahmad | Assignee: | Tommaso Tocci |
| Resolution: | Duplicate | Votes: | 4 |
| Labels: | ShardingRoughEdges, balancer-round-perf, high-value, shardingemea-qw | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Story Points: | 2 | ||||||||||||||||
| Description |
|
Reproduced in MongoDB 3.4.16 and 4.0.3. With a considerable number of chunks (1+ million), the balancer is observed to spend a large amount of time checking each chunk for belonging to a tag. This can lead to a situation where a balancer round spends most of its time finding a candidate chunk (e.g. one minute) rather than migrating a chunk. This can have a significant impact on the overall cluster balancing performance. Below is the a repro where the balancer spends 90% of its time finding a candidate chunk, and only 10% of its time moving the chunk. Off-CPU profiling suggests that the balancer thread is CPU-bound. Attached a 60-second flame graph of the 3.4.16 CSRS primary process. The CSRS primary is only balancing the cluster at that time.
Most CPU time is consumed in BSONObj:woCompare(). |
| Comments |
| Comment by Garaudy Etienne [ 29/Sep/23 ] |
|
To be explicitly clear: This issue is the same as |
| Comment by Matt Panton [ 28/Sep/23 ] |
|
Balancer performance with zone sharding performance has increased with due to the following enhancements - With most of the poor performance due to |
| Comment by Kaloian Manassiev [ 30/Oct/18 ] |
|
Thank you josef.ahmad for the detailed report and for the heat map! Looking at it, this is effectively the same issue as
|