[SERVER-53931] Investigate how to cancel recipients cloning/applying in resharding Created: 20/Jan/21  Updated: 29/Oct/23  Resolved: 15/Mar/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Haley Connelly Assignee: Haley Connelly
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-error-flow
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-54943 Modify Resharding{Recipient|Donor}Ser... Backlog
is depended on by SERVER-53592 Investigate SERVER-52750 Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2021-02-22, Sharding 2021-03-08, Sharding 2021-03-22
Participants:
Story Points: 1

 Description   

Goal: determine how to effectively interrupt recipients cloning/applying in resharding.

It may be useful to look more into cancellation tokens and whether those could be used for such a task.



 Comments   
Comment by Githook User [ 15/Mar/21 ]

Author:

{'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}

Message: SERVER-53931 Use cancelationTokens for resharding recipient replication components
Branch: master
https://github.com/mongodb/mongo/commit/67ff8452c4172dbbfd2199df2dc349eb739b7bf1

Comment by Haley Connelly [ 02/Mar/21 ]

This ticket will now focus on making sure cancel tokens are passed and periodically checked for the cloner, oplog applier, txn cloner, and the oplog fetcher. 

Comment by Haley Connelly [ 09/Feb/21 ]

For now, there isn't a POC with this because priorities have shifted elsewhere.

Comment by Haley Connelly [ 09/Feb/21 ]

Summary of plan for cancellation tokens:

The idea is that resharding state machines will have 2 cancellation tokens. A posToken, the token passed in from the PrimaryOnlyService instances run method (eg ReshardingCoordinatorService::ReshardingCoordinator::run() ), and an abortToken, a token derived from a cancelation source that takes in the posToken.

When there is a stepdown, the posToken will be canceled. When there is an unrecoverable error, the resharding instance will cancel the abortToken. 

Similar to checkIfReceivedDonorAbortMigration() in tenant_migration, resharding should have a method that differentiates between an unrecoverable error  versus a recoverable error via the tokens.

Recoverable error (failover/ stepdown)
If the posToken and the abortToken are canceled, then a recoverable error has occurred. This is because the abortToken is created as a child of the posToken source - if the posToken is canceled, the abortToken automatically gets canceled as well.

Unrecoverable error (abort resharding operation entirely)
The abortToken is canceled, but the posToken is not.
Ex: inside the ReshardingDonorService::onReshardingFieldsChanges(), the updated coordinator document contains an abortReason. Inside the method, the abortToken get's canceled with the abortReason. Inside the main run() future chain for the donor state machine, .onError() knows to start cleaning up the DonorDocument and various collections created for the operation (abort the operation entirely) because abortToken is canceled while posToken is alive. 

 

Comment by Max Hirschhorn [ 21/Jan/21 ]

Marking this as 1 point so that no more than 1 week is spent on it before reporting findings to the group.

Generated at Thu Feb 08 05:32:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.