Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Sprint:
Sharding 2022-10-17, Sharding NYC 2022-10-31
Story Points:
3
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The test is doing moveChunk and stepDown with failpoints, when moveChunk fails (not asserted) verifies that the docs for moveChunk and range deletion are still there.
The problem is that in CS mode the moveChunk actually succeeds and the docs are deleted.

The expected scenario without CS mode:

1. Mongos proxies moveChunk command to config server as _configsvrMoveRange
2. Config starts _configsvrMoveRange command
3. Balancer moveRange() is invoked
4. Donor shard receives the _shardsvrMoveRange command (recipient doesn't matter in this test)
5. replSetStepDown executed at donor
6. _shardsvrMoveRange fails at donor with InterruptedDueToReplStateChange
failpoint asyncRecoverMigrationUntilSuccessOrStepDown reached
7. _configsvrMoveRange fails at config with InterruptedDueToReplStateChange
8. resumeMigrationCoordinationsOnStepUp failpoint reached
9. InterruptedDueToReplStateChange is received by mongos
10. moveChunk failed at mongos

In catalog shard mode the moveChunk succeeds:
1. CS received _configsvrMoveRange
2. "Enqueuing new Balancer command request" _shardsvrMoveRange
3. CS (same server) started executing _shardsvrMoveRange
4. CS received replSetStepDown
ShardsvrMoveRangeCommand failed with InterruptedDueToReplStateChange
"Error processing the remote request" _shardsvrMoveRange
"HostUnreachable: Connection closed by peer" in SessionWorkflow
5. _configsvrMoveRange does not fail
Mongos received "Command failed with retryable error and will be retried" _configsvrMoveRange InterruptedDueToReplStateChange
6. "MigrationCoordinator setting migration decision" aborted
7. "Election succeeded, assuming primary role"
8. "Enqueuing new Balancer command request" _shardsvrMoveRange is repeated again

10 min later

10. Timeout
11. onShardVersionMismatch() callback triggers the reloading of unfinished migration state doc -> recoverRefreshShardVersion()
this happens 10 minutes after the Balancer is scheduled
12. commitChunkMetadataOnConfig() deletes the migration document

More details on 10 minutes wait:
It is essential that the migration command was not aborted.
1. On catalog shard step-up, the balancer initiateBalancer() is invoked
2. On Config part of Catalog shard, BalancerCommandsSchedulerImpl::start() after step up is waiting this time in waitForQuiescedCluster().
3. waitForQuiescedCluster() sends ShardsvrJoinMigrations to all shards
4. ShardsvrJoinMigrationsCommand is invoked on both donor and recipient shards
5. On donor shard, the ShardsvrJoinMigrationsCommand completes promptly

Where shards are blocked:
1. Donor shard is blocked BalancerCommandsSchedulerImpl::start() after step up is waiting in waitForQuiescedCluster()
2. at the same time at the donor side getToken()->refreshOplogTruncateAfterPointIfPrimary() is holding global lock for 10 min

1. On recipient shard, ShardsvrJoinMigrationsCommand blocks on activeMigrationRegistry.lock()
2. Receiver is stuck at MigrationDestinationManager::awaitCriticalSectionReleaseSignalAndCompleteMigration(), with:
_canReleaseCriticalSectionPromise->getFuture().get(opCtx);

When the wait is unblocked:
1. On donor shard
aggregate command is triggered by PeriodicShardedIndexConsistencyChecker
onShardVersionMismatch() -> recoverRefreshShardVersion() -> recoverMigrationCoordinations()
sending _recvChunkReleaseCritSec to recepient
clearReceiveChunk processed by recipient

depends on

SERVER-70771 Invariant failure in ConnectionMetrics

Closed

Assignee:: Andrew Shuvalov (Inactive)
Reporter:: Andrew Shuvalov (Inactive)
Participants:: Andrew Shuvalov
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Oct 11 2022 10:18:17 PM UTC
Updated:: Oct 29 2023 09:32:01 PM UTC
Resolved:: Oct 24 2022 02:05:36 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates