Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.1.0-rc0, 7.0.1
Affects Version/s: None
Component/s: None
Labels:
- shardingemea-qw

Assigned Teams:

Sharding EMEA
Backwards Compatibility:
Fully Compatible
Backport Requested:

v7.0, v6.0, v5.0, v4.4
Sprint:
Sharding EMEA 2023-08-21
Story Points:
2
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When using the balancerShouldReturnRandomMigrations and overrideBalanceRoundInterval failpoints to induce chunk migrations, it seems that the balancer becomes overzealous and starts too many migrations. This actually results in many of the migrations colliding and getting cancelled, therefore reducing our test coverage of chunk migrations.

Consider this example: we have a 2 shard cluster with 2 sharded collections with the above failpoints set. I noticed that 99% percent of chunk migrations were failing, due to the following pattern:
0. Balancer enqueues requests chunk migration requests to both shards.
1. Balancer sends request to Shard 0 to move chunk of nssA to Shard 1.
2. Balancer sends request to Shard 1 to move chunk of nssB to Shard 0.
3. Shard 0 receives the request. Prints "Starting chunk migration donation"
4. Shard 1 receives the request. Prints "Starting chunk migration donation"
5. Shard 0 rejects Shard 1's attempt to donate a chunk ("Rejecting receive chunk due to conflicting donate chunk in progress"), since it is trying to donate nssA's chunk to Shard 1.
6. Shard 1 rejects Shard 0's attempt to donate a chunk (same message printed) since it is already trying to donate nssB's chunk to Shard 0.
7. No chunk migrations occur, because the shards denied each other's migrations.
8. Repeat from the top.

While the above situation of every chunk migration failing is less likely in our passthroughs (due to more collections / shards etc), I noticed that on some runs of multi_stmt_txn_jscore_passthrough_with_migration only half the chunk migrations were actually going through. See my comment for a patch run and more info.

is duplicated by

SERVER-78915 [only affect testing] A shard may be involved in more than one migration per balancer round when using the "Random migrations" feature

Closed

Assignee:: Silvia Surroca
Reporter:: Vishnu Kaushik
Participants:: Githook User, Silvia Surroca, Vishnu Kaushik
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jun 28 2023 03:22:08 PM UTC
Updated:: Oct 29 2023 09:19:28 PM UTC
Resolved:: Aug 08 2023 07:25:37 AM UTC
Confidence Status Last Update:: 07/Aug/23 7:36 AM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates