Race during startup causes extremely rare failures in shard_identity_config_update.js

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • ALL
    • 0
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      During startup, two different threads can race to call ShardRegistry::updateReplSetHosts. The first racing thread is the initAndListen thread, which has the stale CSRS connection string loaded from disk. It calls updateReplSetHosts here.

      The second racing thread (or maybe it's an executor) is the ReplicaSetMonitor. It gets started from the initAndListen thread here and calls updateReplSetHosts through the ShardingReplicaSetChangeListener once it detects the topology change.

      This race can go three different ways, two of which result in the test passing:

      1. ReplicaSetMonitor goes last. This is the simple case. RSM has the up-to-date result and replaces the stale one written from the initAndListen thread. The test passes.
      2. ReplSetMonitor goes first, but fully detects the topology change before the call to getServerAddress here. In this case, initAndListen reads the updated string from the RSM and writes that. The test passes.
      3. ReplicaSetMonitor goes between the call to getServerAddress here and the call to updateReplSetHosts here. In this case, rsMonitorConfigConnStr becomes the stale config string, RSM writes the up-to-date one, and then the initAndListen thread overwrites it with the stale one. This is confirmed by adding a std::this_thread::sleep_for(std::chrono::seconds(10)); right before the call to updateReplSetHosts making the test fail consistently.

      Notice how narrow the window is in case 3, explaining why this is such a rare failure.

            Assignee:
            Janna Golden
            Reporter:
            Ryan Berryhill
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: