Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.2.0, 5.0.5, 5.1.1
Affects Version/s: 5.0.0, 5.1.0
Component/s: Sharding
Labels:
- sharding-nyc-subteam1

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.1, v5.0
Sprint:
Sharding 2021-11-29
Linked BF Score:
145
Story Points:
1
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

resharding::cancelWhenAnyErrorThenQuiesce() uses whenAllSucceed() from future_util.h in combination with whenAll() to wait on all of the data replication components exiting. The whenAllSucceed().onError() pattern is unreliable for this because the onError() lambda won't run when executor (which is the scoped executor here) has already been shut down. The kExecutorShutdownStatus error is propagated back to RecipientStateMachine through _dataReplicationQuiesced and consumed by the onCompletion() which is running on the cleanup executor.

Since the ReshardingOplogFetcher runs on the ReshardingDataReplication::_oplogFetcherExecutor and the whenAll() was skipped, a task from it may still be running at shutdown after the RecipientStateMachine and thus the ReshardingDataReplication has been destroyed. A solution here would be to have RecipientStateMachine::_runMandatoryCleanup() join the ReshardingDataReplication::_oplogFetcherExecutor.

ExecutorFuture<void> cancelWhenAnyErrorThenQuiesce(
    const std::vector<SharedSemiFuture<void>>& futures,
    ExecutorPtr executor,
    CancellationSource cancelSource) {
    return whenAllSucceedOn(futures, executor)
        .onError([futures, executor, cancelSource](Status originalError) mutable {
            cancelSource.cancel();

            return whenAll(thenRunAllOn(futures, executor))
                .ignoreValue()
                .thenRunOn(executor)
                .onCompletion([originalError](auto) { return originalError; });
        });
}

related to

SERVER-61950 ReshardingOplogFetcher waits on network request completing without interruption, potentially preventing shard step-up from ever completing

Closed

Assignee:: Max Hirschhorn
Reporter:: Max Hirschhorn
Participants:: Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Nov 19 2021 03:52:07 PM UTC
Updated:: Oct 29 2023 09:45:46 PM UTC
Resolved:: Nov 20 2021 02:18:02 AM UTC
Confidence Status Last Update:: 19/Nov/21 5:06 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates