[SERVER-78498] Make the balancer failpoint smarter Created: 28/Jun/23  Updated: 29/Oct/23  Resolved: 08/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.1

Type: Bug Priority: Major - P3
Reporter: Vishnu Kaushik Assignee: Silvia Surroca
Resolution: Fixed Votes: 0
Labels: shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
is duplicated by SERVER-78915 [only affect testing] A shard may be ... Closed
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0, v6.0, v5.0, v4.4
Sprint: Sharding EMEA 2023-08-21
Participants:
Story Points: 2

 Description   

When using the balancerShouldReturnRandomMigrations and overrideBalanceRoundInterval failpoints to induce chunk migrations, it seems that the balancer becomes overzealous and starts too many migrations. This actually results in many of the migrations colliding and getting cancelled, therefore reducing our test coverage of chunk migrations.

Consider this example: we have a 2 shard cluster with 2 sharded collections with the above failpoints set. I noticed that 99% percent of chunk migrations were failing, due to the following pattern:
0. Balancer enqueues requests chunk migration requests to both shards.
1. Balancer sends request to Shard 0 to move chunk of nssA to Shard 1.
2. Balancer sends request to Shard 1 to move chunk of nssB to Shard 0.
3. Shard 0 receives the request. Prints "Starting chunk migration donation"
4. Shard 1 receives the request. Prints "Starting chunk migration donation"
5. Shard 0 rejects Shard 1's attempt to donate a chunk ("Rejecting receive chunk due to conflicting donate chunk in progress"), since it is trying to donate nssA's chunk to Shard 1.
6. Shard 1 rejects Shard 0's attempt to donate a chunk (same message printed) since it is already trying to donate nssB's chunk to Shard 0.
7. No chunk migrations occur, because the shards denied each other's migrations.
8. Repeat from the top.

While the above situation of every chunk migration failing is less likely in our passthroughs (due to more collections / shards etc), I noticed that on some runs of multi_stmt_txn_jscore_passthrough_with_migration only half the chunk migrations were actually going through. See my comment for a patch run and more info.



 Comments   
Comment by Githook User [ 17/Aug/23 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-78498 A shard may be involved in multiple migrations for testing

(cherry picked from commit c815f795ecfc9cc0cbabdf0e8ff67889541e8ed4)
Branch: v7.0
https://github.com/mongodb/mongo/commit/eb92d743b8f915b71db2e9f3f5ca78a904385727

Comment by Githook User [ 07/Aug/23 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-78498 A shard may be involved in multiple migrations for testing
Branch: master
https://github.com/mongodb/mongo/commit/c815f795ecfc9cc0cbabdf0e8ff67889541e8ed4

Comment by Silvia Surroca [ 13/Jul/23 ]

That is happening because the shards involved in a chosen migration are not removed from the availableShards list. We should fix this.
 
Currently, with the balancerShouldReturnRandomMigrations fail point we are scheduling one migration per collection regardless of the number of shards, and we are assigning random shards to each migration.
So the more collections there are, the more likely it is to have conflicts between migrations.

Comment by Vishnu Kaushik [ 28/Jun/23 ]

My local logs of the two shard two collection scenario:

[j0:c:prim] | 2023-06-14T22:06:02.023+00:00 D1 SHARDING 21860   [Balancer] "Start balancing round","attr":{"waitForDelete":false,"secondaryThrottle":{}}
...
[j0:c:prim] | 2023-06-14T22:06:07.020+00:00 D2 SHARDING 5847202 [Balancer] "Enqueuing new Balancer command request","attr":{"reqId":{"uuid":{"$uuid":"720c5b41-592a-4ee3-889d-d1ecd7003687"}},"command":"{ _shardsvrMoveRange: \"test.basic3\", toShard: \"shard-rs1\", min: { _id: MinKey }, max: { _id: 0.0 }, waitForDelete: false, epoch: ObjectId('648a396998675a1d5dee0f26'), fromShard: \"shard-rs0\", maxChunkSizeBytes: 134217728, forceJumbo: 0, secondaryThrottle: false, writeConcern: { w: 1, wtimeout: 0 } }","recoveryDocRequired":false}
...
[j0:c:prim] | 2023-06-14T22:06:07.020+00:00 D2 SHARDING 5847202 [Balancer] "Enqueuing new Balancer command request","attr":{"reqId":{"uuid":{"$uuid":"d5835e9f-41e0-4ef1-943e-47ef7c2f7354"}},"command":"{ _shardsvrMoveRange: \"test.basic4\", toShard: \"shard-rs0\", min: { _id: MinKey }, max: { _id: 0.0 }, waitForDelete: false, epoch: ObjectId('648a396898675a1d5dee0ead'), fromShard: \"shard-rs1\", maxChunkSizeBytes: 134217728, forceJumbo: 0, secondaryThrottle: false, writeConcern: { w: 1, wtimeout: 0 } }","recoveryDocRequired":false}
...
[j0:s1:prim] | 2023-06-14T22:06:07.021+00:00 I  MIGRATE  22016   [MoveChunk] "Starting chunk migration donation","attr":{"requestParameters":{"_shardsvrMoveRange":"test.basic4","toShard":"shard-rs0","min":{"_id":{"$minKey":1}},"max":{"_id":0},"waitForDelete":false,"epoch":{"$oid":"648a396898675a1d5dee0ead"},"fromShard":"shard-rs1","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrottle":false},"collectionEpoch":{"$oid":"648a396898675a1d5dee0ead"}}
...
[j0:s0:prim] | 2023-06-14T22:06:07.021+00:00 I  MIGRATE  22016   [MoveChunk] "Starting chunk migration donation","attr":{"requestParameters":{"_shardsvrMoveRange":"test.basic3","toShard":"shard-rs1","min":{"_id":{"$minKey":1}},"max":{"_id":0},"waitForDelete":false,"epoch":{"$oid":"648a396998675a1d5dee0f26"},"fromShard":"shard-rs0","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrottle":false},"collectionEpoch":{"$oid":"648a396998675a1d5dee0f26"}}
...
[j0:s1:prim] | 2023-06-14T22:06:07.034+00:00 I  MIGRATE  6386802 [conn53] "Rejecting receive chunk due to conflicting donate chunk in progress","attr":{"namespace":"test.basic4","runningMigration":{"_shardsvrMoveRange":"test.basic4","toShard":"shard-rs0","min":{"_id":{"$minKey":1}},"max":{"_id":0},"waitForDelete":false,"epoch":{"$oid":"648a396898675a1d5dee0ead"},"fromShard":"shard-rs1","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrottle":false}}
...
[j0:s0:prim] | 2023-06-14T22:06:07.036+00:00 I  MIGRATE  6386802 [conn59] "Rejecting receive chunk due to conflicting donate chunk in progress","attr":{"namespace":"test.basic3","runningMigration":{"_shardsvrMoveRange":"test.basic3","toShard":"shard-rs1","min":{"_id":{"$minKey":1}},"max":{"_id":0},"waitForDelete":false,"epoch":{"$oid":"648a396998675a1d5dee0f26"},"fromShard":"shard-rs0","maxChunkSizeBytes":134217728,"forceJumbo":0,"secondaryThrottle":false}}

Generated at Thu Feb 08 06:38:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.