[SERVER-70487] Fix the test delete_range_deletion_tasks_on_stepup_after_drop_collection.js for catalog shard Created: 11/Oct/22  Updated: 29/Oct/23  Resolved: 24/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-70771 Invariant failure in ConnectionMetrics Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2022-10-17, Sharding NYC 2022-10-31
Participants:
Story Points: 3

 Description   

The test is doing moveChunk and stepDown with failpoints, when moveChunk fails (not asserted) verifies that the docs for moveChunk and range deletion are still there.
The problem is that in CS mode the moveChunk actually succeeds and the docs are deleted.

The expected scenario without CS mode:

1. Mongos proxies moveChunk command to config server as _configsvrMoveRange
2. Config starts _configsvrMoveRange command
3. Balancer moveRange() is invoked
4. Donor shard receives the _shardsvrMoveRange command (recipient doesn't matter in this test)
5. replSetStepDown executed at donor
6. _shardsvrMoveRange fails at donor with InterruptedDueToReplStateChange
failpoint asyncRecoverMigrationUntilSuccessOrStepDown reached
7. _configsvrMoveRange fails at config with InterruptedDueToReplStateChange
8. resumeMigrationCoordinationsOnStepUp failpoint reached
9. InterruptedDueToReplStateChange is received by mongos
10. moveChunk failed at mongos

In catalog shard mode the moveChunk succeeds:
1. CS received _configsvrMoveRange
2. "Enqueuing new Balancer command request" _shardsvrMoveRange
3. CS (same server) started executing _shardsvrMoveRange
4. CS received replSetStepDown
ShardsvrMoveRangeCommand failed with InterruptedDueToReplStateChange
"Error processing the remote request" _shardsvrMoveRange
"HostUnreachable: Connection closed by peer" in SessionWorkflow
5. _configsvrMoveRange does not fail
Mongos received "Command failed with retryable error and will be retried" _configsvrMoveRange InterruptedDueToReplStateChange
6. "MigrationCoordinator setting migration decision" aborted
7. "Election succeeded, assuming primary role"
8. "Enqueuing new Balancer command request" _shardsvrMoveRange is repeated again

10 min later

10. Timeout
11. onShardVersionMismatch() callback triggers the reloading of unfinished migration state doc -> recoverRefreshShardVersion()
this happens 10 minutes after the Balancer is scheduled
12. commitChunkMetadataOnConfig() deletes the migration document

More details on 10 minutes wait:
It is essential that the migration command was not aborted.
1. On catalog shard step-up, the balancer initiateBalancer() is invoked
2. On Config part of Catalog shard, BalancerCommandsSchedulerImpl::start() after step up is waiting this time in waitForQuiescedCluster().
3. waitForQuiescedCluster() sends ShardsvrJoinMigrations to all shards
4. ShardsvrJoinMigrationsCommand is invoked on both donor and recipient shards
5. On donor shard, the ShardsvrJoinMigrationsCommand completes promptly

Where shards are blocked:
1. Donor shard is blocked BalancerCommandsSchedulerImpl::start() after step up is waiting in waitForQuiescedCluster()
2. at the same time at the donor side getToken()->refreshOplogTruncateAfterPointIfPrimary() is holding global lock for 10 min

1. On recipient shard, ShardsvrJoinMigrationsCommand blocks on activeMigrationRegistry.lock()
2. Receiver is stuck at MigrationDestinationManager::awaitCriticalSectionReleaseSignalAndCompleteMigration(), with:
_canReleaseCriticalSectionPromise->getFuture().get(opCtx);

When the wait is unblocked:
1. On donor shard
aggregate command is triggered by PeriodicShardedIndexConsistencyChecker
onShardVersionMismatch() -> recoverRefreshShardVersion() -> recoverMigrationCoordinations()
sending _recvChunkReleaseCritSec to recepient
clearReceiveChunk processed by recipient


Generated at Thu Feb 08 06:16:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.