[SERVER-79771] Make Resharding Operation Resilient to NetworkInterfaceExceededTimeLimit Created: 05/Aug/23  Updated: 29/Oct/23  Resolved: 28/Aug/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 6.0.0, 7.0.0, 7.1.0-rc0
Fix Version/s: 7.1.0-rc0, 6.0.10, 5.0.21, 7.0.2

Type: Improvement Priority: Major - P3
Reporter: Abdul Qadeer Assignee: Abdul Qadeer
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-80020 The exhaustiveFindOnConfig() method s... Backlog
is related to SERVER-72055 NetworkInterfaceTL should always retu... In Code Review
is related to SERVER-58389 Capture NetworkInterfaceExceededTimeL... Closed
Assigned Teams:
Sharding NYC
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0, v6.0, v5.0
Sprint: Sharding NYC 2023-09-04
Participants:
Case:

 Description   

Pasting Max's findings:

The problematic area is in https://github.com/mongodb/mongo/blob/r5.0.19/src/mongo/db/s/resharding/resharding_oplog_fetcher.cpp#L202-L203 where likely at the time of writing the code it was assumed because the function returns a StatusWith<> result it wouldn't be throwing an exception yet it seems like the function can also throw an exception. And so the exception causes the function to propagate an error rather than swallowing the error and retrying by doing the return true.

The ReshardingRecipientService should retry on transient NetworkTimeoutError category errors too in any retry loop. Since the change will be done in resharding_future_util.h, this improvement should affect all code using resharding::withAutomaticRetry



 Comments   
Comment by Githook User [ 28/Aug/23 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-79771 Add retry on NetworkTimeout category errors

(cherry picked with edits from commit 3d7379f3f6f86e504d89d5cb0825fd20842ce27f)
Branch: v5.0
https://github.com/mongodb/mongo/commit/5ae3be172db8654aa54abbfdf2dbf9fee25c4c2a

Comment by Githook User [ 28/Aug/23 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-79771 Add retry on NetworkTimeout category errors

(cherry picked with edits from commit 3d7379f3f6f86e504d89d5cb0825fd20842ce27f)
Branch: v6.0
https://github.com/mongodb/mongo/commit/35cd65218dcfcbfa2536770b9b07053beb0c2de1

Comment by Githook User [ 25/Aug/23 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-79771 Add retry on NetworkTimeout category errors

(cherry picked from commit 3d7379f3f6f86e504d89d5cb0825fd20842ce27f)
Branch: v7.0
https://github.com/mongodb/mongo/commit/52af74de1c41ccd47d7f41c7478ef14ecbe07cee

Comment by Githook User [ 25/Aug/23 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-79771 Add retry on NetworkTimeout category errors
Branch: master
https://github.com/mongodb/mongo/commit/3d7379f3f6f86e504d89d5cb0825fd20842ce27f

Comment by Max Hirschhorn [ 11/Aug/23 ]

Based on an experiment Garaudy performed in Atlas (changing the instance size while running resharding), it is confirmed to also be possible for the ReshardingCollectionCloner to receive a NetworkInterfaceExceededTimeLimit error while the recipient shard is waiting for a network connection to one of the donor shards. Therefore, in addition to retrying in the presence of this error directly in the ReshardingOplogFetcher (which has its own retry loop), the resharding::WithAutomaticRetry logic should also retry on the the NetworkInterfaceExceededTimeLimit error. See also SERVER-72055.

Generated at Thu Feb 08 06:41:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.