-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
When featureFlagReshardingVerification is enabled, on the coordinator, there is an additional step between engaging the critical section and waiting for strict consistency or critical section timeout, which is running the _shardsvrReshardingDonorFetchFinalCollectionStats command against all donors to get the delta in the number of documents since the cloneTimestamp. On each donor, this command blocks until the donor has written the "blocking-writes" oplog entry, which is after the critical section has been acquired.
In the case where the donor is blocked on acquiring the critical section (e.g. if there is an in-progress transaction), this command is expected to remain blocked. When the critical section times out, the _allRecipientsReportedStrictConsistencyTimestamp would get fulfilled by the ReshardingCoordinatorObserver with a ReshardingCriticalSectionTimeout error but the ReshardingCoordinatorService would still get stuck here since it doesn't also wait on that future. The fix would be to make `_fetchAndPersistNumDocumentsToCloneFromDonors` have a `whenAny` on both the awaitAllRecipientsInStrictConsistency() future and the getDocumentsDeltaFromDonors future, similar to this.