Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-70487

Fix the test delete_range_deletion_tasks_on_stepup_after_drop_collection.js for catalog shard

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible
    • Sharding 2022-10-17, Sharding NYC 2022-10-31
    • 3

      The test is doing moveChunk and stepDown with failpoints, when moveChunk fails (not asserted) verifies that the docs for moveChunk and range deletion are still there.
      The problem is that in CS mode the moveChunk actually succeeds and the docs are deleted.

      The expected scenario without CS mode:

      1. Mongos proxies moveChunk command to config server as _configsvrMoveRange
      2. Config starts _configsvrMoveRange command
      3. Balancer moveRange() is invoked
      4. Donor shard receives the _shardsvrMoveRange command (recipient doesn't matter in this test)
      5. replSetStepDown executed at donor
      6. _shardsvrMoveRange fails at donor with InterruptedDueToReplStateChange
      failpoint asyncRecoverMigrationUntilSuccessOrStepDown reached
      7. _configsvrMoveRange fails at config with InterruptedDueToReplStateChange
      8. resumeMigrationCoordinationsOnStepUp failpoint reached
      9. InterruptedDueToReplStateChange is received by mongos
      10. moveChunk failed at mongos

      In catalog shard mode the moveChunk succeeds:
      1. CS received _configsvrMoveRange
      2. "Enqueuing new Balancer command request" _shardsvrMoveRange
      3. CS (same server) started executing _shardsvrMoveRange
      4. CS received replSetStepDown
      ShardsvrMoveRangeCommand failed with InterruptedDueToReplStateChange
      "Error processing the remote request" _shardsvrMoveRange
      "HostUnreachable: Connection closed by peer" in SessionWorkflow
      5. _configsvrMoveRange does not fail
      Mongos received "Command failed with retryable error and will be retried" _configsvrMoveRange InterruptedDueToReplStateChange
      6. "MigrationCoordinator setting migration decision" aborted
      7. "Election succeeded, assuming primary role"
      8. "Enqueuing new Balancer command request" _shardsvrMoveRange is repeated again

      10 min later

      10. Timeout
      11. onShardVersionMismatch() callback triggers the reloading of unfinished migration state doc -> recoverRefreshShardVersion()
      this happens 10 minutes after the Balancer is scheduled
      12. commitChunkMetadataOnConfig() deletes the migration document

      More details on 10 minutes wait:
      It is essential that the migration command was not aborted.
      1. On catalog shard step-up, the balancer initiateBalancer() is invoked
      2. On Config part of Catalog shard, BalancerCommandsSchedulerImpl::start() after step up is waiting this time in waitForQuiescedCluster().
      3. waitForQuiescedCluster() sends ShardsvrJoinMigrations to all shards
      4. ShardsvrJoinMigrationsCommand is invoked on both donor and recipient shards
      5. On donor shard, the ShardsvrJoinMigrationsCommand completes promptly

      Where shards are blocked:
      1. Donor shard is blocked BalancerCommandsSchedulerImpl::start() after step up is waiting in waitForQuiescedCluster()
      2. at the same time at the donor side getToken()->refreshOplogTruncateAfterPointIfPrimary() is holding global lock for 10 min

      1. On recipient shard, ShardsvrJoinMigrationsCommand blocks on activeMigrationRegistry.lock()
      2. Receiver is stuck at MigrationDestinationManager::awaitCriticalSectionReleaseSignalAndCompleteMigration(), with:
      _canReleaseCriticalSectionPromise->getFuture().get(opCtx);

      When the wait is unblocked:
      1. On donor shard
      aggregate command is triggered by PeriodicShardedIndexConsistencyChecker
      onShardVersionMismatch() -> recoverRefreshShardVersion() -> recoverMigrationCoordinations()
      sending _recvChunkReleaseCritSec to recepient
      clearReceiveChunk processed by recipient

            Assignee:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Reporter:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: