[SERVER-58448] Improve HostUnreachable error response for intra-cluster operations Created: 12/Jul/21  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Eric Sedor Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: sa-remove-fv-backlog-22
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-60063 Log server discovery times Closed
Assigned Teams:
Service Arch
Participants:

 Description   

This ticket requests improvement in mongos's response to clients when mongos cannot reach a necessary shard for an operation (such as if mongod maxIncomingConnections is reached).

Specifically, is it possible to identify in the failure response what server could not be reached as well as what server failed to reach it? This would help ensure users are not attempting to diagnose connectivity issues between clients and routers when the issue is between routers and shard members (or in rare cases, between two mongods)

The mongos logs for such a failure look like:

{"t":{"$date":"2021-07-12T13:30:10.300-07:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn526","msg":"client metadata","attr":{"remote":"127.0.0.1:62159","client":"conn526","doc":{"driver":{"name":"PyMongo","version":"3.12.0b0"},"os":{"type":"Darwin","name":"Darwin","architecture":"x86_64","version":"10.16"},"platform":"CPython 3.7.2.final.0","mongos":{"host":"nodachi:27017","client":"127.0.0.1:62159","version":"4.4.7-rc1"}}}}
{"t":{"$date":"2021-07-12T13:30:10.322-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:10.329-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:14.383-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:14.397-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:19.557-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:19.574-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:24.391-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:24.402-07:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"conn526","msg":"Host failed in replica set","attr":{"replicaSet":"shard01","host":"localhost:27018","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"action":{"dropConnections":true,"requestImmediateCheck":false,"outcome":{"host":"localhost:27018","success":false,"errorMessage":"HostUnreachable: Connection closed by peer"}}}}
{"t":{"$date":"2021-07-12T13:30:24.414-07:00"},"s":"I",  "c":"QUERY",    "id":4625501, "ctx":"conn526","msg":"Unable to establish remote cursors","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"},"nRemotes":3}}
{"t":{"$date":"2021-07-12T13:30:24.422-07:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn526","msg":"Slow query","attr":{"type":"command","ns":"test.test","command":{"aggregate":"test","pipeline":[{"$merge":{"into":"test2"}}],"cursor":{},"lsid":{"id":{"$uuid":"8e633a66-15fe-4c5e-b2b2-34c148a76b41"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1626121804,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"test","$readPreference":{"mode":"primary"}},"numYields":0,"ok":0,"errMsg":"Connection closed by peer","errName":"HostUnreachable","errCode":6,"reslen":241,"protocol":"op_msg","durationMillis":14113}}
{"t":{"$date":"2021-07-12T13:31:42.821-07:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn526","msg":"Connection ended","attr":{"remote":"127.0.0.1:62159","connectionId":526,"connectionCount":5}}

While not very concise, it is possible to understand from the mongos logs that the mongos failed to reach localhost:27018.

However, the response to the client is less scrutable:

{"ok": 0.0, "errmsg": "Connection closed by peer", "code": 6, "codeName": "HostUnreachable", "operationTime": {"$timestamp": {"t": 1626121818, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1626121838, "i": 1}}, "signature": {"hash": {"$binary": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "$type": "00"}, "keyId": 0}}}

and the shell's response is:

mongos> db.test.aggregate([{$merge:{into:"test2"}}])
uncaught exception: Error: command failed: {
	"ok" : 0,
	"errmsg" : "Connection closed by peer",
	"code" : 6,
	"codeName" : "HostUnreachable",
	"operationTime" : Timestamp(1626114452, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1626114488, 1),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:665:17
assert.commandWorked@src/mongo/shell/assert.js:755:16
DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
@(shell):1:1



 Comments   
Comment by Eric Sedor [ 14/Sep/21 ]

Thank you Ratika! matthew.saltz, are you proposing making it clearer that the error is within the cluster by saying something generic like "Router could not reach a shard member necessary for this operation"? If so, yes, that satisfies this request within the bounds of the security concern you raise.

It makes total sense to me that including topological details could run afoul of the access privileges granted to the database user executing an operation (and that checking privileges for an error response is not worth the effort).

As well, yes to your second question. Just to elaborate by example: It seems entirely likely to me that a user could be operating off of a dashboard that bubbles these responses up to them and with nothing else to go on would either A) not know where to start or B) treat it as a connectivity issue from the app to the mongos.

But I'd raise this question: Would you consider it possible to include enough information in the response to inform the client/user which mongos's logs they could check to trace the failure further?

Comment by Ratika Gandhi [ 14/Sep/21 ]

Tagging eric.sedor for a response. Thanks

Comment by Matthew Saltz (Inactive) [ 31/Aug/21 ]

We're not sure for security reasons whether it's allowed to expose the internal cluster topology to the client by reporting the host and port of the unreachable host. However, we could at least report whether the error occurred within the cluster or between the client and the cluster (even though the client should be able to tell locally that they were not disconnected from the node they're contacting. Would that be helpful at all?

Alternatively what problem exactly would this help with? Is this for users who run into this error message and are confused about what to do? eric.sedor

Generated at Thu Feb 08 05:44:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.