[SERVER-73897] Resharding coordinator returns generic abort error after recovery from stepdown Created: 10/Feb/23  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: cs-subteam1, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File new_test.diff    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-80930 reshardCollection command can return ... Closed
Assigned Teams:
Cluster Scalability
Operating System: ALL
Participants:
Linked BF Score: 5

 Description   

When resharding aborts, it stores the abort reason in the coordinator document. If it steps down and restarts again, it will abort the cancel token when it sees that the state is aborting. This in turn will cause it to get callback cancelled error later (I suspect from here) and the resharding coordinator will treat it like the user aborted resharding and return the generic ReshardingAborted error code instead of the original error code.



 Comments   
Comment by Randolph Tan [ 10/Feb/23 ]

Attached diff for cpp test that the resharding coordinator returns the original error code on normal case without stepdown (this will pass) and a test case where it should also return the original error after a stepdown (this will fail).

Generated at Thu Feb 08 06:25:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.