Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58448

Improve HostUnreachable error response for intra-cluster operations

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Service Arch

      This ticket requests improvement in mongos's response to clients when mongos cannot reach a necessary shard for an operation (such as if mongod maxIncomingConnections is reached).

      Specifically, is it possible to identify in the failure response what server could not be reached as well as what server failed to reach it? This would help ensure users are not attempting to diagnose connectivity issues between clients and routers when the issue is between routers and shard members (or in rare cases, between two mongods)

      The mongos logs for such a failure look like:

      {"t":{"$date":"2021-07-12T13:30:10.300-07:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn526","msg":"client metadata","attr":{"remote":"127.0.0.1:62159","client":"conn526","doc":{"driver":{"name":"PyMongo","version":"3.12.0b0"},"os":{"type":"Darwin","name":"Darwin","architecture":"x86_64","version":"10.16"},"platform":"CPython 3.7.2.final.0","mongos":{"host":"nodachi:27017","client":"127.0.0.1:62159","version":"4.4.7-rc1"}}}}
      {"t":{"$date":"2021-07-12T13:30:10.322-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:10.329-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:14.383-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:14.397-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:19.557-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:19.574-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:24.391-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:24.402-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
      {"t":{"$date":"2021-07-12T13:30:24.414-07:00"},"s":"I",  "c":"QUERY",    "id":4625501, "ctx":"conn526","msg":"Unable to establish remote cursors","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"nRemotes":3}}
      {"t":{"$date":"2021-07-12T13:30:24.422-07:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn526","msg":"Slow query","attr":{"type":"command","ns":"test.test","command":{"aggregate":"test","pipeline":[{"$merge":{"into":"test2"}}],"cursor":{},"lsid":{"id":{"$uuid":"8e633a66-15fe-4c5e-b2b2-34c148a76b41"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1626121804,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"numYields":0,"ok":0,"errMsg":"Connection closed by peer","errName":"HostUnreachable","errCode":6,"reslen":241,"protocol":"op_msg","durationMillis":14113}}
      {"t":{"$date":"2021-07-12T13:31:42.821-07:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn526","msg":"Connection ended","attr":{"remote":"127.0.0.1:62159","connectionId":526,"connectionCount":5}}
      

      While not very concise, it is possible to understand from the mongos logs that the mongos failed to reach localhost:27018.

      However, the response to the client is less scrutable:

      {"ok": 0.0, "errmsg": "Connection closed by peer", "code": 6, "codeName": "HostUnreachable", "operationTime": {"$timestamp": {"t": 1626121818, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1626121838, "i": 1}}, "signature": {"hash": {"$binary": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "$type": "00"}, "keyId": 0}}}
      

      and the shell's response is:

      mongos> db.test.aggregate([{$merge:{into:"test2"}}])
      uncaught exception: Error: command failed: {
      	"ok" : 0,
      	"errmsg" : "Connection closed by peer",
      	"code" : 6,
      	"codeName" : "HostUnreachable",
      	"operationTime" : Timestamp(1626114452, 1),
      	"$clusterTime" : {
      		"clusterTime" : Timestamp(1626114488, 1),
      		"signature" : {
      			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      			"keyId" : NumberLong(0)
      		}
      	}
      } : aggregate failed :
      _getErrorWithCode@src/mongo/shell/utils.js:25:13
      doassert@src/mongo/shell/assert.js:18:14
      _assertCommandWorked@src/mongo/shell/assert.js:665:17
      assert.commandWorked@src/mongo/shell/assert.js:755:16
      DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
      DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
      @(shell):1:1
      

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            eric.sedor@mongodb.com Eric Sedor
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: