Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23192

mongos and shards will become unusable if contact is lost with all CSRS config server nodes for more than 30 consecutive failed attempts to contact

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.3.11
    • Affects Version/s: 3.2.4
    • Component/s: Sharding
    • Labels:
      None
    • Minor Change
    • ALL
    • Hide

      See the comment in jstests/sharding/startup_with_all_configs_down.js.

      Show
      See the comment in jstests/sharding/startup_with_all_configs_down.js.
    • Sharding 16 (06/24/16), Sharding 18 (08/05/16)
    • 0

      Issue Status as of Oct 07, 2016

      ISSUE DESCRIPTION AND IMPACT
      If mongos loses network contact with all nodes from the CSRS config server set (both primary and secondaries), the replica set monitor will deem this set as as 'unusable' and will stop monitoring it. As a result, all operations which need to access config server metadata will fail.

      DIAGNOSIS AND AFFECTED VERSIONS
      This issue is present on MongoDB 3.2.0 to 3.2.9.

      Operations that require access config server metadata will begin failing with the following error:

      > db.foo.find().itcount();
      
      2016-03-16T16:59:20.941-0400 E QUERY    [thread1] Error: error: {
              "code" : 71,
              "ok" : 0,
              "errmsg" : "None of the hosts for replica set test-configRS could be contacted."
      } :
      _getErrorWithCode@src/mongo/shell/utils.js:25:13
      DBCommandCursor@src/mongo/shell/query.js:694:1
      DBQuery.prototype._exec@src/mongo/shell/query.js:118:28
      DBQuery.prototype.hasNext@src/mongo/shell/query.js:281:5
      DBQuery.prototype.itcount@src/mongo/shell/query.js:407:12
      @(shell):1:16
      

      REMEDIATION AND WORKAROUNDS
      To resolve this issue, restart the affected mongos or mongod.

      This issue has been fixed in MongoDB 3.4.0 and 3.2.10:

      • MongoDB 3.4.0 contains the fix described in this ticket..
      • MongoDB 3.2.10 contains the fix described by SERVER-25516.

      On versions prior to MongoDB 3.2.10, this issue can be avoided by executing the following command at runtime on all mongos and mongod nodes:

      db.adminCommand( {setParameter: 1, 'replMonitorMaxFailedChecks': 2147483647} )
      

      Please note that this parameter does not persist and must be set each time the node restarts.

      Original description

      If mongos loses network contact with all nodes from the CSRS config server set (both primary and secondaries), the replica set monitor will deem this set as as 'unusable' and will stop monitoring it.

      From this point onward all operations which need to access some config server metadata will begin failing with the following error:

      > db.foo.find().itcount();
      
      2016-03-16T16:59:20.941-0400 E QUERY    [thread1] Error: error: {
              "code" : 71,
              "ok" : 0,
              "errmsg" : "None of the hosts for replica set test-configRS could be contacted."
      } :
      _getErrorWithCode@src/mongo/shell/utils.js:25:13
      DBCommandCursor@src/mongo/shell/query.js:694:1
      DBQuery.prototype._exec@src/mongo/shell/query.js:118:28
      DBQuery.prototype.hasNext@src/mongo/shell/query.js:281:5
      DBQuery.prototype.itcount@src/mongo/shell/query.js:407:12
      @(shell):1:16
      

      This includes the refresh of the list of shards, which needs to be read from the config server metadata. Therefore, currently there is no procedure to restart or retry monitoring of the config server set and the only recourse is to restart mongos.

            Assignee:
            misha.tyulenev@mongodb.com Misha Tyulenev
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            5 Vote for this issue
            Watchers:
            32 Start watching this issue

              Created:
              Updated:
              Resolved: