Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61052

Resharding Donor & Recipient's Coordinator Doc Updates Can Time Out Waiting for Replication on Coordinator Doc, Leading to Fatal Assertion

    XMLWordPrintable

Details

    • Fully Compatible
    • v5.1, v5.0
    • Sharding 2021-11-01
    • 1

    Description

      Cause
      If the update to the config server when updating the coordinator document takes longer than the specified wTimeout, the command will fatally error.

      Context
      Both the ReshardingRecipientService and ReshardingDonorService need to update the coordinator document in the config server with the latest state of the donor and recipient accordingly. While both attempt to retry errors that occur, not every error is retryable. One of these non-retryable errors is WriteConcernFailed, which can be generated from write timeout errors.

      Problem
      The commands used by the resharding components are designed to simply wait for the results of their command, so we shouldn't throw an error if the write is taking a long time.

      In the ReshardingDonorService, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with

      {w: "majority"}

      specified with no wtimeout.*

      In the RecipientStateMachineExternalStateImpl, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with

      {w: "majority"}

      specified with no wtimeout.*

       

      Source
      You can see the following log lines in this Evergreen Patch

      756555:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F",  "c":"RESHARD",  "id":5160600, "ctx":"ReshardingDonorService-1","msg":"Unrecoverable error occurred past the point donor was prepared to complete the resharding operation","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 60000, provenance: \"clientSupplied\" } }"}}
      756556:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"ReshardingDonorService-1","msg":"Fatal assertion","attr":{"msgid":5160600,"file":"src/mongo/db/s/resharding/resharding_donor_service.cpp","line":459}}
      756557:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.275+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"ReshardingDonorService-1","msg":"\n\n***aborting after fassert() failure\n\n"}
      

      Attachments

        Issue Links

          Activity

            People

              luis.osta@mongodb.com Luis Osta (Inactive)
              luis.osta@mongodb.com Luis Osta (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: