Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.3.0, 5.2.0-rc4
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.2
Sprint:
Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10
Linked BF Score:
156
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A JS test defining a sequence of commands like

 var joinMoveChunk1 = moveChunkParallel(
    staticMongod, st.s0.host, {Key: 10}, null, 'TestDB.TestColl', st.shard2.shardName);
var joinMoveChunk2 = moveChunkParallel(
    staticMongod, st.s0.host, {Key: 30}, null, 'TestDB.TestColl', st.shard3.shardName);

waitForMigrateStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled);
waitForMigrateStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled);

unpauseMigrateAtStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled);
unpauseMigrateAtStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled);

joinMoveChunk1();
joinMoveChunk2();

May reach a deadlock state when a step-down event occurs after having issued the shard command for moveChunk1, but before sending out the one for moveChunk2, provoked by the fact that:

on step-up, moveChunk1 will be regenerated as part of the recovery procedure of the balancer, which will only be completed once such commands also completes
moveChunk2 will also set for dispatching during the step-up, but the command won't be actually sent as long as the recovery is over.
nevertheless, the recovery will never end, since moveChunk1 is blocked by the presence of an active failpoint (which will only be disabled once moveChunk2 is over)

The behaviour described in 1) and 2) matches the designed functionality of the BalancerCommandsScheduler (and the legacy MigrationManager). The proposal is then to solve the problem by avoiding the execution of the affected tests in stepdown suites.

is related to

SERVER-59891 Replace the coverage from sharding_continuous_config_stepdown.yml and then delete the test suite

Backlog

Assignee:: Paolo Polato
Reporter:: Paolo Polato
Participants:: Githook User, Paolo Polato
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Dec 19 2021 10:14:36 PM UTC
Updated:: Oct 29 2023 09:44:51 PM UTC
Resolved:: Jan 04 2022 08:23:26 AM UTC
Confidence Status Last Update:: 20/Dec/21 8:49 AM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates