cluster-level setClusterParameter can run indefinitely if one shard fails

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Catalog and Routing
    • ALL
    • 🟩 Routing and Topology
    • None
    • None
    • None
    • None
    • None
    • None

      Running the jstest code below, setClusterParameter will just keep trying to set the parameter on a shard will never succeed because of a failpoint. The only log message you get at default verbosity is a deeply-internal one that doesn't help understand the situation and in fact could mislead you to thinking the network is bad ("ctx":"NetworkInterfaceTL-ConfigsvrCoordinatorServiceNetwork","msg":"Ending connection due to bad connection status","attr":
      {"hostAndPort":"ip-10-122-13-216:20045","error":"CallbackCanceled: Baton wait canceled"). Maybe in practice, shards always execute this command successfully, but we've seen a (rather unique) case where this can cause a lot of confusion. Here is the jstest code, with increased log verbosity for some relevant components that makes it clearer what's going on:

      import {ShardingTest} from "jstests/libs/shardingtest.js";
      import {configureFailPoint} from "jstests/libs/fail_point_util.js";
      
      const st = new ShardingTest({shards: 3});
      
      const csrsPrimary = st.configRS.getPrimary();
      csrsPrimary.adminCommand({
          setParameter: 1,
          logComponentVerbosity: {
              network: { verbosity: 2, asio: { verbosity: 5 } }
          },
      });
      
      const shard0Primary = st.rs0.getPrimary();
      
      jsTest.log('enabling failCommand on shard0 primary (port ' + shard0Primary.port + ')');
      
      let fp = configureFailPoint(shard0Primary, "failCommand", {
          errorCode: ErrorCodes.InternalError,
          failCommands: ["_shardsvrSetClusterParameter"],
          failInternalCommands: true,
      });
      
      jsTest.log('Running setClusterParameter');
      assert.commandWorked(
          st.s.adminCommand({setClusterParameter: {defaultMaxTimeMS: {readOperations: 120000}}}));
      
      st.stop();
      

      The test will hang during setClusterParameter.

            Assignee:
            Unassigned
            Reporter:
            Ryan Berryhill
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: