-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
ClusterScalability Nov10-Nov24, ClusterScalability Nov24-Dec8
-
5
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Currently, when the critical section times out, we rely on fulfilling the _allRecipientsReportedStrictConsistencyTimestamp promise in the ReshardingCoordinatorObserver to cancel resharding. However, that is not sufficient since there are other steps that come before the wait for that promise.
- The first one is running _flushReshardingStateChange againt all donors and recipients. This command triggers a sharding metadata refresh but it does not wait for the refresh to complete, so it is generally quick to run. However, if the critical section times out, the coordinator should immediately transitions to "aborting" and tell donors to transition out of "blocking-writes" instead of continuing to wait for the responses for this command from all donors and recipients.
- The second one is running _shardsvrReshardingDonorFetchFinalCollectionStats against all donors. This was added in SPM-3918. This command waits for each donor to have acquired the critical section and for its ReshardingChangeStreamsMonitor to have processed all oplog entries until the "blocking-writes" oplog entry. If for reason the donor cannot acquire the critical section (due to an in-progress transaction), the command is expected to hang. So it is critical to stop waiting for the responses for this command if the critical section has timed out.
In summary, a critical section timeout should cancel all the work that it does before aborting. This could be done by making it cancel the abort source. Currently, this source is only cancelled when there is explicitly abort by the user. So we need to introduce the notion of implicit abort to ReshardingCoordinatorService.
- is depended on by
-
SERVER-109322 featureFlagReshardingSkipCloningAndApplyingIfApplicable makes resharding critical section get acquired on a non-donor db primary shard before critical section is engaged by coordinator
-
- Closed
-
- related to
-
SERVER-114077 Make sure that there can never be dangling _shardsvrRecipientCriticalSectionStarted threads when resharding gets aborted both implicitly and explicitly
-
- Backlog
-
-
SERVER-114180 Add Helper for Handling Tasserts in Prod
-
- Needs Scheduling
-
-
SERVER-114451 Aborting a resharding operation in "aborted" or "quiesced" state as part of setFCV right after stepup should cancel the quiesce period
-
- Closed
-