Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47169

Sharding initialization contacts config shard before ShardRegistry updated by RSM, preventing mongos from starting up

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.4.0-rc0, 4.7.0
    • Affects Version/s: None
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • v4.4
    • Hide

      Apply the following patch to have mongos delay updating the ShardRegistry after the listener is notified about a confirmed replica set.

      python buildscripts/resmoke.py --suite=sharding repro_mongos_fails_during_sharding_initialization.js
      
      Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      diff --git a/repro_mongos_fails_during_sharding_initialization.js b/repro_mongos_fails_during_sharding_initialization.js
      new file mode 100644
      index 0000000000..4c2429b244
      --- /dev/null
      +++ b/repro_mongos_fails_during_sharding_initialization.js
      @@ -0,0 +1,14 @@
      +(function() {
      +'use strict';
      +
      +var st = new ShardingTest({config: 3, mongos: 1, shards: 0});
      +
      +// XXX: Even with the artificial delay in ShardingReplicaSetChangeListener::onConfirmedSet(), the
      +// issue only seems to manifest in mongos about half the time. We restart the mongos process a few
      +// times to make the failure more apparent.
      +for (let i = 0; i < 10; ++i) {
      +    st.restartMongos(0);
      +}
      +
      +st.stop();
      +}());
      diff --git a/src/mongo/s/server.cpp b/src/mongo/s/server.cpp
      index 34bb98f4d9..39ff0c436f 100644
      --- a/src/mongo/s/server.cpp
      +++ b/src/mongo/s/server.cpp
      @@ -491,6 +491,12 @@ public:
                   invariant(args.status);
      
                   try {
      +                LOGV2(2284600,
      +                      "Sleeping before updating sharding state with confirmed set {connStr}",
      +                      "connStr"_attr = connStr);
      +
      +                sleepmillis(10000);
      +
                       LOGV2(22846,
                             "Updating sharding state with confirmed set {connStr}",
                             "connStr"_attr = connStr);
      
      Show
      Apply the following patch to have mongos delay updating the ShardRegistry after the listener is notified about a confirmed replica set. python buildscripts/resmoke.py --suite=sharding repro_mongos_fails_during_sharding_initialization.js Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml diff --git a/repro_mongos_fails_during_sharding_initialization.js b/repro_mongos_fails_during_sharding_initialization.js new file mode 100644 index 0000000000..4c2429b244 --- /dev/ null +++ b/repro_mongos_fails_during_sharding_initialization.js @@ -0,0 +1,14 @@ +(function() { + 'use strict' ; + + var st = new ShardingTest({config: 3, mongos: 1, shards: 0}); + + // XXX: Even with the artificial delay in ShardingReplicaSetChangeListener::onConfirmedSet(), the + // issue only seems to manifest in mongos about half the time. We restart the mongos process a few + // times to make the failure more apparent. + for (let i = 0; i < 10; ++i) { + st.restartMongos(0); +} + +st.stop(); +}()); diff --git a/src/mongo/s/server.cpp b/src/mongo/s/server.cpp index 34bb98f4d9..39ff0c436f 100644 --- a/src/mongo/s/server.cpp +++ b/src/mongo/s/server.cpp @@ -491,6 +491,12 @@ public : invariant(args.status); try { + LOGV2(2284600, + "Sleeping before updating sharding state with confirmed set {connStr}" , + "connStr" _attr = connStr); + + sleepmillis(10000); + LOGV2(22846, "Updating sharding state with confirmed set {connStr}" , "connStr" _attr = connStr);
    • Sharding 2020-04-06
    • 27

      The ShardingNetworkConnectionHook causes a ShardNotFound error status to be returned if the HostAndPort isn't found in the ShardRegistry. This hook is run after a connection to the remote host has been established.

      Status ShardingNetworkConnectionHook::validateHostImpl(
          const HostAndPort& remoteHost, const executor::RemoteCommandResponse& isMasterReply) {
          auto shard =
              Grid::get(getGlobalServiceContext())->shardRegistry()->getShardForHostNoReload(remoteHost);
          if (!shard) {
              return {ErrorCodes::ShardNotFound,
                      str::stream() << "No shard found for host: " << remoteHost.toString()};
          }
      
          ...
      }
      

      The connection string for config shard may be updated while the sharding subsystem is initializing. (For reasons I still don't quite understand, this doesn't happen every time mongos is started, but I believe it is a necessary condition for the issue reported here to manifest.) Updating the connection string upon receiving isMaster responses from secondaries of the config shard (where the primary is still seen by the RSM as "Unknown") would remove the HostAndPort for the primary from ShardRegistry::_hostLookup. Re-adding the HostAndPort for the primary to ShardRegistry::_hostLookup happens as part of ShardingReplicaSetChangeListener::onConfirmedSet() by scheduling a task on the fixed executor. Since the ShardRegistry::_hostLookup map isn't updated synchronously, it is possible for the RSM to view the now-confirmed primary as being available for targeting primary-only reads, but for the post-connection established validate hook to fail. This leads to mongos being unable to start up successfully.

            Assignee:
            haley.connelly@mongodb.com Haley Connelly
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: