[SERVER-61633] Resharding's RecipientStateMachine doesn't join thread pool for ReshardingOplogFetcher, leading to server crash at shutdown Created: 19/Nov/21  Updated: 29/Oct/23  Resolved: 20/Nov/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 5.1.0
Fix Version/s: 5.2.0, 5.0.5, 5.1.1

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-61950 ReshardingOplogFetcher waits on netwo... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0
Sprint: Sharding 2021-11-29
Participants:
Linked BF Score: 145
Story Points: 1

 Description   

resharding::cancelWhenAnyErrorThenQuiesce() uses whenAllSucceed() from future_util.h in combination with whenAll() to wait on all of the data replication components exiting. The whenAllSucceed().onError() pattern is unreliable for this because the onError() lambda won't run when executor (which is the scoped executor here) has already been shut down. The kExecutorShutdownStatus error is propagated back to RecipientStateMachine through _dataReplicationQuiesced and consumed by the onCompletion() which is running on the cleanup executor.

Since the ReshardingOplogFetcher runs on the ReshardingDataReplication::_oplogFetcherExecutor and the whenAll() was skipped, a task from it may still be running at shutdown after the RecipientStateMachine and thus the ReshardingDataReplication has been destroyed. A solution here would be to have RecipientStateMachine::_runMandatoryCleanup() join the ReshardingDataReplication::_oplogFetcherExecutor.

ExecutorFuture<void> cancelWhenAnyErrorThenQuiesce(
    const std::vector<SharedSemiFuture<void>>& futures,
    ExecutorPtr executor,
    CancellationSource cancelSource) {
    return whenAllSucceedOn(futures, executor)
        .onError([futures, executor, cancelSource](Status originalError) mutable {
            cancelSource.cancel();
 
            return whenAll(thenRunAllOn(futures, executor))
                .ignoreValue()
                .thenRunOn(executor)
                .onCompletion([originalError](auto) { return originalError; });
        });
}



 Comments   
Comment by Githook User [ 20/Nov/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61633 Join _oplogFetcherExecutor in resharding recipient at exit.

Also corrects the 5.0 backport of
1bd1c4f6a0d571443a80c52d1b3f284a0c078af4 from SERVER-59812 and leaves
the ReshardingMetrics intact until the resharding data replication
components have quiesced.

(cherry picked from commit 34cac37ac5a61946aae9d149c8cb2f1d109e7320)
Branch: v5.0
https://github.com/mongodb/mongo/commit/3d22412e0eed75c96771a849d4e98e3309f458f0

Comment by Githook User [ 20/Nov/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61633 Join _oplogFetcherExecutor in resharding recipient at exit.

(cherry picked from commit 34cac37ac5a61946aae9d149c8cb2f1d109e7320)
Branch: v5.1
https://github.com/mongodb/mongo/commit/d0a2c4c526a3a4d9bc0501a3f07300375c5b1c4d

Comment by Githook User [ 19/Nov/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61633 Join _oplogFetcherExecutor in resharding recipient at exit.
Branch: master
https://github.com/mongodb/mongo/commit/34cac37ac5a61946aae9d149c8cb2f1d109e7320

Generated at Thu Feb 08 05:52:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.