[SERVER-61633] Resharding's RecipientStateMachine doesn't join thread pool for ReshardingOplogFetcher, leading to server crash at shutdown Created: 19/Nov/21 Updated: 29/Oct/23 Resolved: 20/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 5.0.0, 5.1.0 |
| Fix Version/s: | 5.2.0, 5.0.5, 5.1.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Max Hirschhorn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v5.1, v5.0
|
||||||||||||||||
| Sprint: | Sharding 2021-11-29 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 145 | ||||||||||||||||
| Story Points: | 1 | ||||||||||||||||
| Description |
|
resharding::cancelWhenAnyErrorThenQuiesce() uses whenAllSucceed() from future_util.h in combination with whenAll() to wait on all of the data replication components exiting. The whenAllSucceed().onError() pattern is unreliable for this because the onError() lambda won't run when executor (which is the scoped executor here) has already been shut down. The kExecutorShutdownStatus error is propagated back to RecipientStateMachine through _dataReplicationQuiesced and consumed by the onCompletion() which is running on the cleanup executor. Since the ReshardingOplogFetcher runs on the ReshardingDataReplication::_oplogFetcherExecutor and the whenAll() was skipped, a task from it may still be running at shutdown after the RecipientStateMachine and thus the ReshardingDataReplication has been destroyed. A solution here would be to have RecipientStateMachine::_runMandatoryCleanup() join the ReshardingDataReplication::_oplogFetcherExecutor.
|
| Comments |
| Comment by Githook User [ 20/Nov/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Also corrects the 5.0 backport of (cherry picked from commit 34cac37ac5a61946aae9d149c8cb2f1d109e7320) |
| Comment by Githook User [ 20/Nov/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: (cherry picked from commit 34cac37ac5a61946aae9d149c8cb2f1d109e7320) |
| Comment by Githook User [ 19/Nov/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: |