Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58407

Resharding components do not retry on FailedToSatisfyReadPreference when targeting remote shard, leading to server crash

    • Fully Compatible
    • v5.1, v5.0
    • Sharding 2021-11-01
    • 1

      There are multiple places where the ReshardingCoordinatorService, ReshardingRecipientService, and ReshardingDonorService attempt to target the primary of a replica set shard:

      Internally, these function calls go through RemoteCommandTargeterRS::findHost() and will throw a FailedToSatisfyReadPreference after kDefaultFindHostTimeout 15 seconds if a primary is unavailable on the remote shard. This exception is caught and leads to an fassert() because, for example, it would be invalid for the participant shards to complete the resharding operation without performing a w:majority on the config server primary.

      The resharding components should instead wait until a primary becomes available on the remote shard to avoid triggering this fassert().

      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.440+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-4","msg":"Transitioned resharding recipient state","attr":{"newState":"applying","oldState":"cloning","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20002","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20000","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333218, "ctx":"ReshardingRecipientService-7","msg":"Rescheduling the next replica set monitoring request","attr":{"replicaSet":"config-rs","host":"localhost:20000","delayMillis":0}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20001","replicaSet":"config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.465+00:00"},"s":"I",  "c":"RESHARD",  "id":4956500, "ctx":"ReshardingRecipientService-7","msg":"Resharding operation recipient state machine failed","attr":{"namespace":"test1_fsmdb0.fsmcoll0","reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}},"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.466+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-7","msg":"Transitioned resharding recipient state","attr":{"newState":"error","oldState":"applying","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"RESHARD",  "id":5551101, "ctx":"ReshardingRecipientService-5","msg":"Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"ReshardingRecipientService-5","msg":"Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":412}}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"ReshardingRecipientService-5","msg":"\n\n***aborting after fassert() failure\n\n"}
      [j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ReshardingRecipientService-5","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
      

      https://logkeeper.mongodb.org/lobster/build/e6cde008d39df341d7eb3b22eff2f45b/test/616ec55854f248304d3ff5b3#bookmarks=0%2C10033%2C10034%2C10035%2C10036%2C10038%2C10039%2C10040%2C10041%2C10054%2C10088%2C593975&f~=100~%28%5C%5Bj0%3As0%3An1%5C%5D%7C%5C%5Bj0%3As1%3An2%5C%5D%29&l=1&shareLine=10033

            Assignee:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Reporter:
            blake.oler@mongodb.com Blake Oler
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: