A JS test defining a sequence of commands like
var joinMoveChunk1 = moveChunkParallel( staticMongod, st.s0.host, {Key: 10}, null, 'TestDB.TestColl', st.shard2.shardName); var joinMoveChunk2 = moveChunkParallel( staticMongod, st.s0.host, {Key: 30}, null, 'TestDB.TestColl', st.shard3.shardName); waitForMigrateStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled); waitForMigrateStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled); unpauseMigrateAtStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled); unpauseMigrateAtStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled); joinMoveChunk1(); joinMoveChunk2();
May reach a deadlock state when a step-down event occurs after having issued the shard command for moveChunk1, but before sending out the one for moveChunk2, provoked by the fact that:
- on step-up, moveChunk1 will be regenerated as part of the recovery procedure of the balancer, which will only be completed once such commands also completes
- moveChunk2 will also set for dispatching during the step-up, but the command won't be actually sent as long as the recovery is over.
- nevertheless, the recovery will never end, since moveChunk1 is blocked by the presence of an active failpoint (which will only be disabled once moveChunk2 is over)
The behaviour described in 1) and 2) matches the designed functionality of the BalancerCommandsScheduler (and the legacy MigrationManager). The proposal is then to solve the problem by avoiding the execution of the affected tests in stepdown suites.
- is related to
-
SERVER-59891 Replace the coverage from sharding_continuous_config_stepdown.yml and then delete the test suite
- Backlog