Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47359

ShardRegistry reload can race with RSM updates to ShardRegistry

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4.0-rc2, 4.7.0
    • Component/s: Sharding
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4
    • Sprint:
      Sharding 2020-04-20
    • Linked BF Score:
      115

      Description

      During startup, the ShardRegistry stores all nodes in the initial config server string and then schedules its initial reload and the RSM begins to monitor the config server. During start up, if the RSM gets a topology change of type ReplicaSetNoPrimary, (meaning its only heard back from secondaries), it will update the ShardRegistry with this info and remove any nodes that are not yet confirmed. The following set of steps can lead to a ShardNotFound error:

      1. ShardRegistry starts up with all config server nodes passed in the initial connection string.
      2. RSM starts up and begins monitoring the config server.
      3. RSM hears back from a secondary node (let's say node 1), the TopologyChangeEvent can look something like [node 0: type unknown, node 1: type secondary, node 2: type unknown]. The RSM notifies the ShardRegistry with this info. This will cause the ShardRegistry to remove node 0 and node 2 from the ShardRegistryData object `_data` owned by the ShardRegistry. (Specifically it removes hosts from the ShardRegistryData's `_hostLookup` and `_lookup` maps here)
      4. ShardRegistry's initial refresh starts. It constructs a new ShardRegsitryData object which at this moment only has node 1 for the config server.
      5. RSM gets a TopologyChangeEvent from the primary which can look something like [node 0: type primary, node 1: type secondary, node 2: type secondary]. The RSM now thinks each of these nodes is able to be targeted. It sends an update to the ShardRegistry with this info.
      6. The ShardRegistry adds each of the nodes the RSM notified it about to the `_data` object.
      7. The ShardRegistry reload swaps out it's `_data` object that was just updated in step 6 with the ShardRegistryData object in constructed in step 4.
      8. The RSM thinks node 0 is able to be targeted, so returns it to some command to target. The ShardingNetworkConnectionHook then tries to target node 0 and grab it from the ShardRegistry, but it does not exist anymore so throws ShardNotFound.

        Attachments

          Activity

            People

            Assignee:
            janna.golden Janna Golden
            Reporter:
            janna.golden Janna Golden
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: