[SERVER-78498] Make the balancer failpoint smarter Created: 28/Jun/23 Updated: 29/Oct/23 Resolved: 08/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0, 7.0.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vishnu Kaushik | Assignee: | Silvia Surroca |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | shardingemea-qw | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Backport Requested: |
v7.0, v6.0, v5.0, v4.4
|
||||||||||||
| Sprint: | Sharding EMEA 2023-08-21 | ||||||||||||
| Participants: | |||||||||||||
| Story Points: | 2 | ||||||||||||
| Description |
|
When using the balancerShouldReturnRandomMigrations and overrideBalanceRoundInterval failpoints to induce chunk migrations, it seems that the balancer becomes overzealous and starts too many migrations. This actually results in many of the migrations colliding and getting cancelled, therefore reducing our test coverage of chunk migrations. Consider this example: we have a 2 shard cluster with 2 sharded collections with the above failpoints set. I noticed that 99% percent of chunk migrations were failing, due to the following pattern: While the above situation of every chunk migration failing is less likely in our passthroughs (due to more collections / shards etc), I noticed that on some runs of multi_stmt_txn_jscore_passthrough_with_migration only half the chunk migrations were actually going through. See my comment for a patch run and more info. |
| Comments |
| Comment by Githook User [ 17/Aug/23 ] | |||||||||||||
|
Author: {'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}Message: (cherry picked from commit c815f795ecfc9cc0cbabdf0e8ff67889541e8ed4) | |||||||||||||
| Comment by Githook User [ 07/Aug/23 ] | |||||||||||||
|
Author: {'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}Message: | |||||||||||||
| Comment by Silvia Surroca [ 13/Jul/23 ] | |||||||||||||
|
That is happening because the shards involved in a chosen migration are not removed from the availableShards list. We should fix this. | |||||||||||||
| Comment by Vishnu Kaushik [ 28/Jun/23 ] | |||||||||||||
|
My local logs of the two shard two collection scenario:
|