-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: 8.1.0-rc4
-
Component/s: None
-
Cluster Scalability
-
ClusterScalability Apr28-May09
-
0
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
If the featureFlagReshardingCloneNoRefresh is enabled, the coordinator will use ShardsvrReshardRecipientClone to tell the recipients to begin cloning.
When the coordinator sends this command, it will block until the command returns. The command itself will call fulfillAllDonorsPreparedToDonate, which fulfills a promise then waits for a future to complete. This future is using a cancellation token, however its source is the opCtx that is running the command, which has no knowledge of resharding or its progress. If the recipient receives an error before fulfilling _transitionedToCreateCollection, the command will therefore hang indefinitely, causing the coordinator to also hang.
This can occur if any nonretryable error occurs in the recipient within this region. Until we do SERVER-102452, an obvious way for this to happen is for the wait for majority to fail here.
Resolving this is also not as simple as using the abortSource as the cancellation token. This is only actually cancelled when the coordinator updates itself to aborting or calls ShardsvrAbortReshardCollectionCommand, which it won't do because it is busy hanging on ShardsvrReshardRecipientClone.
- is caused by
-
SERVER-97464 Implement resharding recipient cloning transition using commands.
-
- Closed
-
- is related to
-
SERVER-104265 Disable feature flag gFeatureFlagReshardingCloneNoRefresh
-
- In Progress
-
-
SERVER-104317 All resharding services should retry on WCEs
-
- Open
-
-
SERVER-102452 Make ReshardingDonorService retry on WriteConcernFailure when finishing
-
- In Code Review
-
- related to
-
SERVER-104494 Resharding can hang if recipient steps down during ShardsvrReshardRecipientClone
-
- Needs Scheduling
-
-
SERVER-102769 Adapt featureFlagReshardingCloneNoRefresh to use OFCV-aware checks
-
- Backlog
-