Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 8.1.0-rc4
Component/s: None
Labels:
- resharding-improvements

Assigned Teams:

Cluster Scalability
Sprint:
ClusterScalability Apr28-May09, ClusterScalability May12-May25, ClusterScalability Jun9-Jun23, ClusterScalability Jul7-Jul20
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

If the featureFlagReshardingCloneNoRefresh is enabled, the coordinator will use ShardsvrReshardRecipientClone to tell the recipients to begin cloning.

When the coordinator sends this command, it will block until the command returns. The command itself will call fulfillAllDonorsPreparedToDonate, which fulfills a promise then waits for a future to complete. This future is using a cancellation token, however its source is the opCtx that is running the command, which has no knowledge of resharding or its progress. If the recipient receives an error before fulfilling _transitionedToCreateCollection, the command will therefore hang indefinitely, causing the coordinator to also hang.

This can occur if any nonretryable error occurs in the recipient within this region. Until we do ~~SERVER-102452~~, an obvious way for this to happen is for the wait for majority to fail here.

Resolving this is also not as simple as using the abortSource as the cancellation token. This is only actually cancelled when the coordinator updates itself to aborting or calls ShardsvrAbortReshardCollectionCommand, which it won't do because it is busy hanging on ShardsvrReshardRecipientClone.

is caused by

SERVER-97464 Implement resharding recipient cloning transition using commands.

Closed

is depended on by

SERVER-105353 reshard_collection_atlas_log_ingestion.js Fails in AwaitingFetchTimestamp

Blocked

is related to

SERVER-104265 Disable feature flag gFeatureFlagReshardingCloneNoRefresh

Closed

SERVER-104317 Update WithAutomaticRetry to retry on WCEs

In Code Review

SERVER-102452 Make ReshardingDonorService retry on WriteConcernFailure when finishing

Closed

related to

SERVER-104494 Resharding can hang if recipient steps down during ShardsvrReshardRecipientClone

Backlog

SERVER-102769 Adapt featureFlagReshardingCloneNoRefresh to use OFCV-aware checks

Backlog

(2 related to)

Assignee:: Unassigned
Reporter:: Brett Nawrocki
Participants:: Brett Nawrocki, TPM Jira Automations Bot
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Due:: 31/Jul/25
Created:: Apr 23 2025 07:42:27 PM UTC
Updated:: Jul 07 2025 04:34:43 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates