Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58407

Resharding components do not retry on FailedToSatisfyReadPreference when targeting remote shard, leading to server crash

    XMLWordPrintableJSON

Details

    • Fully Compatible
    • v5.1, v5.0
    • Sharding 2021-11-01
    • 1

    Description

      There are multiple places where the ReshardingCoordinatorService, ReshardingRecipientService, and ReshardingDonorService attempt to target the primary of a replica set shard:

      Internally, these function calls go through RemoteCommandTargeterRS::findHost() and will throw a FailedToSatisfyReadPreference after kDefaultFindHostTimeout 15 seconds if a primary is unavailable on the remote shard. This exception is caught and leads to an fassert() because, for example, it would be invalid for the participant shards to complete the resharding operation without performing a w:majority on the config server primary.

      The resharding components should instead wait until a primary becomes available on the remote shard to avoid triggering this fassert().

      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.440+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-4","msg":"Transitioned resharding recipient state","attr":{"newState":"applying","oldState":"cloning","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20002","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20000","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333218, "ctx":"ReshardingRecipientService-7","msg":"Rescheduling the next replica set monitoring request","attr":{"replicaSet":"config-rs","host":"localhost:20000","delayMillis":0}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20001","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.465+00:00"},"s":"I",  "c":"RESHARD",  "id":4956500, "ctx":"ReshardingRecipientService-7","msg":"Resharding operation recipient state machine failed","attr":{"namespace":"test1_fsmdb0.fsmcoll0","reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}},"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.466+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-7","msg":"Transitioned resharding recipient state","attr":{"newState":"error","oldState":"applying","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"RESHARD",  "id":5551101, "ctx":"ReshardingRecipientService-5","msg":"Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"ReshardingRecipientService-5","msg":"Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":412}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"ReshardingRecipientService-5","msg":"\n\n***aborting after fassert() failure\n\n"}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ReshardingRecipientService-5","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
      

      https://logkeeper.mongodb.org/lobster/build/e6cde008d39df341d7eb3b22eff2f45b/test/616ec55854f248304d3ff5b3#bookmarks=0%2C10033%2C10034%2C10035%2C10036%2C10038%2C10039%2C10040%2C10041%2C10054%2C10088%2C593975&f~=100~%28%5C%5Bj0%3As0%3An1%5C%5D%7C%5C%5Bj0%3As1%3An2%5C%5D%29&l=1&shareLine=10033

      Attachments

        Activity

          People

            max.hirschhorn@mongodb.com Max Hirschhorn
            blake.oler@mongodb.com Blake Oler
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: