Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58407

Resharding components do not retry on FailedToSatisfyReadPreference when targeting remote shard, leading to server crash

    XMLWordPrintable

Details

    • Fully Compatible
    • v5.1, v5.0
    • Sharding 2021-11-01
    • 1

    Description

      There are multiple places where the ReshardingCoordinatorService, ReshardingRecipientService, and ReshardingDonorService attempt to target the primary of a replica set shard:

      Internally, these function calls go through RemoteCommandTargeterRS::findHost() and will throw a FailedToSatisfyReadPreference after kDefaultFindHostTimeout 15 seconds if a primary is unavailable on the remote shard. This exception is caught and leads to an fassert() because, for example, it would be invalid for the participant shards to complete the resharding operation without performing a w:majority on the config server primary.

      The resharding components should instead wait until a primary becomes available on the remote shard to avoid triggering this fassert().

      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.440+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-4","msg":"Transitioned resharding recipient state","attr":{"newState":"applying","oldState":"cloning","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20002","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20000","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333218, "ctx":"ReshardingRecipientService-7","msg":"Rescheduling the next replica set monitoring request","attr":{"replicaSet":"config-rs","host":"localhost:20000","delayMillis":0}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20001","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.465+00:00"},"s":"I",  "c":"RESHARD",  "id":4956500, "ctx":"ReshardingRecipientService-7","msg":"Resharding operation recipient state machine failed","attr":{"namespace":"test1_fsmdb0.fsmcoll0","reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}},"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.466+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-7","msg":"Transitioned resharding recipient state","attr":{"newState":"error","oldState":"applying","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"RESHARD",  "id":5551101, "ctx":"ReshardingRecipientService-5","msg":"Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"ReshardingRecipientService-5","msg":"Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":412}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"ReshardingRecipientService-5","msg":"\n\n***aborting after fassert() failure\n\n"}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ReshardingRecipientService-5","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
      

      https://logkeeper.mongodb.org/lobster/build/e6cde008d39df341d7eb3b22eff2f45b/test/616ec55854f248304d3ff5b3#bookmarks=0%2C10033%2C10034%2C10035%2C10036%2C10038%2C10039%2C10040%2C10041%2C10054%2C10088%2C593975&f~=100~%28%5C%5Bj0%3As0%3An1%5C%5D%7C%5C%5Bj0%3As1%3An2%5C%5D%29&l=1&shareLine=10033

      Attachments

        Issue Links

          Activity

            People

              max.hirschhorn@mongodb.com Max Hirschhorn
              blake.oler@mongodb.com Blake Oler
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: