[SERVER-62181] JStests including multiple parallel migrations with failpoints shouldn't be run in the config server stepdown suites Created: 19/Dec/21  Updated: 29/Oct/23  Resolved: 04/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.3.0, 5.2.0-rc4

Type: Bug Priority: Major - P3
Reporter: Paolo Polato Assignee: Paolo Polato
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-59891 Replace the coverage from sharding_co... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.2
Sprint: Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10
Participants:
Linked BF Score: 156

 Description   

A JS test defining a sequence of commands like

 var joinMoveChunk1 = moveChunkParallel(
    staticMongod, st.s0.host, {Key: 10}, null, 'TestDB.TestColl', st.shard2.shardName);
var joinMoveChunk2 = moveChunkParallel(
    staticMongod, st.s0.host, {Key: 30}, null, 'TestDB.TestColl', st.shard3.shardName);
 
waitForMigrateStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled);
waitForMigrateStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled);
 
unpauseMigrateAtStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled);
unpauseMigrateAtStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled);
 
joinMoveChunk1();
joinMoveChunk2();

May reach a deadlock state when a step-down event occurs after having issued the shard command for moveChunk1, but before sending out the one for moveChunk2, provoked by the fact that:

  1. on step-up, moveChunk1 will be regenerated as part of the recovery procedure of the balancer, which will only be completed once such commands also completes
  2. moveChunk2 will also set for dispatching during the step-up, but the command won't be actually sent as long as the recovery is over.
  3. nevertheless, the recovery will never end, since moveChunk1 is blocked by the presence of an active failpoint (which will only be disabled once moveChunk2 is over)

The behaviour described in 1) and 2) matches the designed functionality of the BalancerCommandsScheduler (and the legacy MigrationManager). The proposal is then to solve the problem by avoiding the execution of the affected tests in stepdown suites.



 Comments   
Comment by Githook User [ 04/Jan/22 ]

Author:

{'name': 'Paolo Polato', 'email': 'paolo.polato@mongodb.com', 'username': 'ppolato'}

Message: SERVER-62181 Do not run tests with multiple parallel migrations in stepdown suites

(cherry picked from commit 68b1c9ad4fe1c4015c587968bc23b7ec45c5dd17)
Branch: v5.2
https://github.com/mongodb/mongo/commit/53b4b21bcf83ffc14156478293c90bfeab3b10a5

Comment by Githook User [ 21/Dec/21 ]

Author:

{'name': 'Paolo Polato', 'email': 'paolo.polato@mongodb.com', 'username': 'ppolato'}

Message: SERVER-62181 Do not run tests with multiple parallel migrations in stepdown suites
Branch: master
https://github.com/mongodb/mongo/commit/68b1c9ad4fe1c4015c587968bc23b7ec45c5dd17

Generated at Thu Feb 08 05:54:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.