[SERVER-79771] Make Resharding Operation Resilient to NetworkInterfaceExceededTimeLimit Created: 05/Aug/23 Updated: 29/Oct/23 Resolved: 28/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 5.0.0, 6.0.0, 7.0.0, 7.1.0-rc0 |
| Fix Version/s: | 7.1.0-rc0, 6.0.10, 5.0.21, 7.0.2 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Abdul Qadeer | Assignee: | Abdul Qadeer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Sharding NYC
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Backport Requested: |
v7.0, v6.0, v5.0
|
||||||||||||||||||||
| Sprint: | Sharding NYC 2023-09-04 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
Pasting Max's findings:
The ReshardingRecipientService should retry on transient NetworkTimeoutError category errors too in any retry loop. Since the change will be done in resharding_future_util.h, this improvement should affect all code using resharding::withAutomaticRetry |
| Comments |
| Comment by Githook User [ 28/Aug/23 ] |
|
Author: {'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}Message: (cherry picked with edits from commit 3d7379f3f6f86e504d89d5cb0825fd20842ce27f) |
| Comment by Githook User [ 28/Aug/23 ] |
|
Author: {'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}Message: (cherry picked with edits from commit 3d7379f3f6f86e504d89d5cb0825fd20842ce27f) |
| Comment by Githook User [ 25/Aug/23 ] |
|
Author: {'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}Message: (cherry picked from commit 3d7379f3f6f86e504d89d5cb0825fd20842ce27f) |
| Comment by Githook User [ 25/Aug/23 ] |
|
Author: {'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}Message: |
| Comment by Max Hirschhorn [ 11/Aug/23 ] |
|
Based on an experiment Garaudy performed in Atlas (changing the instance size while running resharding), it is confirmed to also be possible for the ReshardingCollectionCloner to receive a NetworkInterfaceExceededTimeLimit error while the recipient shard is waiting for a network connection to one of the donor shards. Therefore, in addition to retrying in the presence of this error directly in the ReshardingOplogFetcher (which has its own retry loop), the resharding::WithAutomaticRetry logic should also retry on the the NetworkInterfaceExceededTimeLimit error. See also SERVER-72055. |