Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-79771

Make Resharding Operation Resilient to NetworkInterfaceExceededTimeLimit

    • Sharding NYC
    • Fully Compatible
    • v7.0, v6.0, v5.0
    • Sharding NYC 2023-09-04

      Pasting Max's findings:

      The problematic area is in https://github.com/mongodb/mongo/blob/r5.0.19/src/mongo/db/s/resharding/resharding_oplog_fetcher.cpp#L202-L203 where likely at the time of writing the code it was assumed because the function returns a StatusWith<> result it wouldn't be throwing an exception yet it seems like the function can also throw an exception. And so the exception causes the function to propagate an error rather than swallowing the error and retrying by doing the return true.

      The ReshardingRecipientService should retry on transient NetworkTimeoutError category errors too in any retry loop. Since the change will be done in resharding_future_util.h, this improvement should affect all code using resharding::withAutomaticRetry

            abdul.qadeer@mongodb.com Abdul Qadeer
            abdul.qadeer@mongodb.com Abdul Qadeer
            0 Vote for this issue
            4 Start watching this issue