Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-62181

JStests including multiple parallel migrations with failpoints shouldn't be run in the config server stepdown suites

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Fixed
    • None
    • 5.3.0, 5.2.0-rc4
    • None
    • None
    • Fully Compatible
    • ALL
    • v5.2
    • Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10
    • 156

    Description

      A JS test defining a sequence of commands like

       var joinMoveChunk1 = moveChunkParallel(
          staticMongod, st.s0.host, {Key: 10}, null, 'TestDB.TestColl', st.shard2.shardName);
      var joinMoveChunk2 = moveChunkParallel(
          staticMongod, st.s0.host, {Key: 30}, null, 'TestDB.TestColl', st.shard3.shardName);
       
      waitForMigrateStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled);
      waitForMigrateStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled);
       
      unpauseMigrateAtStep(st.shard2, migrateStepNames.rangeDeletionTaskScheduled);
      unpauseMigrateAtStep(st.shard3, migrateStepNames.rangeDeletionTaskScheduled);
       
      joinMoveChunk1();
      joinMoveChunk2();
      

      May reach a deadlock state when a step-down event occurs after having issued the shard command for moveChunk1, but before sending out the one for moveChunk2, provoked by the fact that:

      1. on step-up, moveChunk1 will be regenerated as part of the recovery procedure of the balancer, which will only be completed once such commands also completes
      2. moveChunk2 will also set for dispatching during the step-up, but the command won't be actually sent as long as the recovery is over.
      3. nevertheless, the recovery will never end, since moveChunk1 is blocked by the presence of an active failpoint (which will only be disabled once moveChunk2 is over)

      The behaviour described in 1) and 2) matches the designed functionality of the BalancerCommandsScheduler (and the legacy MigrationManager). The proposal is then to solve the problem by avoiding the execution of the affected tests in stepdown suites.

      Attachments

        Issue Links

          Activity

            People

              paolo.polato@mongodb.com Paolo Polato
              paolo.polato@mongodb.com Paolo Polato
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: