[SERVER-31922] Make the migration chunk cloner source resilient to stepdowns and network errors Created: 10/Nov/17  Updated: 06/Dec/22  Resolved: 29/Jul/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-36126 moveChunk fails on the source shard w... Closed
Related
Assigned Teams:
Sharding
Participants:
Linked BF Score: 20

 Description   

On the donor shard, the MigrationChunkClonerSourceLegacy::startClone code uses _callRecipient private class function to call the recipient, which then uses the task executor to make the call. The task executor does not retry NotMaster errors.

A solution would be to use a ShardRemote, instead, and allow NotMaster errors to be retried for that first command, _recvChunkStart – don't want to use it for all commands, but the first one is safe, I think.



 Comments   
Comment by Ratika Gandhi [ 29/Jul/19 ]

Low on priority. Please reopen if this is required. 

Generated at Thu Feb 08 04:28:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.