Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.3.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
ClusterScalability Nov10-Nov24, ClusterScalability Nov24-Dec8
Story Points:
5
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently, when the critical section times out, we rely on fulfilling the _allRecipientsReportedStrictConsistencyTimestamp promise in the ReshardingCoordinatorObserver to cancel resharding. However, that is not sufficient since there are other steps that come before the wait for that promise.

The first one is running _flushReshardingStateChange againt all donors and recipients. This command triggers a sharding metadata refresh but it does not wait for the refresh to complete, so it is generally quick to run. However, if the critical section times out, the coordinator should immediately transitions to "aborting" and tell donors to transition out of "blocking-writes" instead of continuing to wait for the responses for this command from all donors and recipients.
The second one is running _shardsvrReshardingDonorFetchFinalCollectionStats against all donors. This was added in SPM-3918. This command waits for each donor to have acquired the critical section and for its ReshardingChangeStreamsMonitor to have processed all oplog entries until the "blocking-writes" oplog entry. If for reason the donor cannot acquire the critical section (due to an in-progress transaction), the command is expected to hang. So it is critical to stop waiting for the responses for this command if the critical section has timed out.

In summary, a critical section timeout should cancel all the work that it does before aborting. This could be done by making it cancel the abort source. Currently, this source is only cancelled when there is explicitly abort by the user. So we need to introduce the notion of implicit abort to ReshardingCoordinatorService.

is depended on by

SERVER-109322 featureFlagReshardingSkipCloningAndApplyingIfApplicable makes resharding critical section get acquired on a non-donor db primary shard before critical section is engaged by coordinator

Closed

related to

SERVER-114077 Make sure that there can never be dangling _shardsvrRecipientCriticalSectionStarted threads when resharding gets aborted both implicitly and explicitly

Backlog

SERVER-115997 Resharding coordinator gets stuck in infinite retry loop when abort is requested during commit phase

Closed

SERVER-114180 Add helper for triggering tripwire without throwing

Backlog

SERVER-114451 Aborting a resharding operation in "aborted" or "quiesced" state as part of setFCV right after stepup should cancel the quiesce period

Closed

Assignee:: Cheahuychou Mao
Reporter:: Cheahuychou Mao
Participants:: Cheahuychou Mao, Githook User
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Nov 17 2025 05:01:21 PM UTC
Updated:: Dec 26 2025 06:59:10 PM UTC
Resolved:: Nov 26 2025 02:35:16 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates