When featureFlagReshardingVerification is enabled, resharding could hang if critical section times out while donors are still trying to acquire critical section

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When featureFlagReshardingVerification is enabled, on the coordinator, there is an additional step between engaging the critical section and waiting for strict consistency or critical section timeout, which is running the _shardsvrReshardingDonorFetchFinalCollectionStats command against all donors to get the delta in the number of documents since the cloneTimestamp. On each donor, this command blocks until the donor has written the "blocking-writes" oplog entry, which is after the critical section has been acquired.

      In the case where the donor is blocked on acquiring the critical section (e.g. if there is an in-progress transaction), this command is expected to remain blocked. When the critical section times out, the _allRecipientsReportedStrictConsistencyTimestamp would get fulfilled by the ReshardingCoordinatorObserver with a ReshardingCriticalSectionTimeout error but the ReshardingCoordinatorService would still get stuck here since it doesn't also wait on that future. The fix would be to make `_fetchAndPersistNumDocumentsToCloneFromDonors` have a `whenAny` on both the awaitAllRecipientsInStrictConsistency() future and the getDocumentsDeltaFromDonors future, similar to this.

            Assignee:
            Cheahuychou Mao
            Reporter:
            Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: