Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.3.0, 5.1.2, 5.0.6, 5.2.0-rc1
Affects Version/s: 5.0.0, 5.1.0
Component/s: Sharding
Labels:
- sharding-nyc-subteam1

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.2, v5.1, v5.0
Sprint:
Sharding 2021-12-13
Linked BF Score:
50
Story Points:
3
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The ReshardingOplogFetcher uses ShardRemote::runAggregation() to run resharding's oplog fetching pipeline. ShardRemote::runAggregation() uses the Fetcher class to schedule the remote network requests. Fetcher::join() doesn't wait using an OperationContext so it continues to block even after the node steps down. For as long as the network request continues to run on the remote node, the the still-active Instance will prevent PrimaryOnlyService::onStepUp() and the overall step-up procedure from completing.

We should instead have Fetcher::join() wait using an OperationContext so the Fetcher::~Fetcher() destructor can abandon waiting for the remote network request.

is related to

SERVER-60859 ReshardingCoordinator waits on _canEnterCritical future without cancellation, potentially preventing config server primary step-up from ever completing

Closed

SERVER-61633 Resharding's RecipientStateMachine doesn't join thread pool for ReshardingOplogFetcher, leading to server crash at shutdown

Closed

Assignee:: Max Hirschhorn
Reporter:: Max Hirschhorn
Participants:: Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Dec 07 2021 09:01:34 PM UTC
Updated:: Oct 29 2023 09:45:14 PM UTC
Resolved:: Dec 10 2021 03:12:12 PM UTC
Confidence Status Last Update:: 07/Dec/21 9:02 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates