[GODRIVER-1516] Go driver does not appear to obey DNS changes Created: 03/Mar/20  Updated: 27/Oct/23  Resolved: 16/Mar/20

Status: Closed
Project: Go Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ben Birt Assignee: Divjot Arora (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongo.yaml     Text File primaryIsMaster.txt     Text File secondaryIsMaster.txt    

 Description   

Apologies if this isn't the right place to file this!

 

I'm running my client code (which uses the Go MongoDB driver) on Kubernetes, along with 3 hosts running MongoDB (in a replica set). The client is configured to connect to "mongodb://mongo-0.mongo,mongo-1.mongo,mongo-2.mongo:27017". This works.

 

 

However, I have just noticed that if all Mongo replicas restart (and are given new IP addresses by Kubernetes), the client code doesn't seem to pick up this change; trying to run Mongo queries fails with 'connection reset by peer' errors which indicate that the client was trying to communicate with the old IP addresses.

 

I'm not sure what I should be doing here to handle this. Have we misconfigured some monitor setup, or do we need to configure some DNS resolution refresh frequency option? Should we be using "mongodb+srv" connection strings? I'm a little lost.



 Comments   
Comment by Divjot Arora (Inactive) [ 16/Mar/20 ]

ben@dataform.co Glad to hear it worked! I'm going to close out this ticket as "Works as Designed" but feel free to leave another comment or open a new ticket if you have any other issues!

– Divjot

Comment by Ben Birt [ 16/Mar/20 ]

Amazing, thank you!

I've set that variable, and everything appears to be working nicely:

{
	"hosts" : [
		"mongo-0.mongo.app-staging.svc.cluster.local:27017",
		"mongo-1.mongo.app-staging.svc.cluster.local:27017",
		"mongo-2.mongo.app-staging.svc.cluster.local:27017"
	],
	"setName" : "rs0",
	"setVersion" : 8231617,
	"ismaster" : true,
	"secondary" : false,
	"primary" : "mongo-1.mongo.app-staging.svc.cluster.local:27017",
	"me" : "mongo-1.mongo.app-staging.svc.cluster.local:27017",
	"electionId" : ObjectId("7fffffff0000000000000021"),
	"lastWrite" : {
		"opTime" : {
			"ts" : Timestamp(1584348624, 2),
			"t" : NumberLong(33)
		},
		"lastWriteDate" : ISODate("2020-03-16T08:50:24Z"),
		"majorityOpTime" : {
			"ts" : Timestamp(1584348624, 2),
			"t" : NumberLong(33)
		},
		"majorityWriteDate" : ISODate("2020-03-16T08:50:24Z")
	},
	"maxBsonObjectSize" : 16777216,
	"maxMessageSizeBytes" : 48000000,
	"maxWriteBatchSize" : 100000,
	"localTime" : ISODate("2020-03-16T08:50:33.405Z"),
	"logicalSessionTimeoutMinutes" : 30,
	"connectionId" : 407,
	"minWireVersion" : 0,
	"maxWireVersion" : 8,
	"readOnly" : false,
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1584348624, 2),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	},
	"operationTime" : Timestamp(1584348624, 2)
}

Would you like me to close this issue?

Comment by Divjot Arora (Inactive) [ 13/Mar/20 ]

I think we have an understanding of the root cause and some ideas for fixing it. The issue is happening because the isMaster responses report the hardcoded IP addresses in the hosts field, which is what the driver uses. So the sequence of events is the following (this is a simplified view for a single-node replica set, but can be generalized to your three node case as well):

  1. The driver initializes an internal representation of a node at address mongo-0.mongo
  2. The driver sends an isMaster to the node. During connection creation, the node's address is resolved via DNS
  3. The node reports that the hosts are [10.8.4.214:27017]
  4. The driver calculates a difference between the nodes it knows about and the hosts reported in the isMaster and re-configures its view. In this case, that means throwing out the server at mongo-0.mongo and constructing a new server at 10.8.4.214:27017
  5. When the pod is bounced and gets a new IP address, the driver continues to send isMaster commands to 10.8.4.214:27017 because it is no longer aware of mongo-0.mongo. That node is down so the requests end with connection errors and lead to the server selection error you're observing.

Hopefully that gives you a sense of what's going on under the hood. Given the attached isMaster responses, this is expected behavior.

On the https://github.com/cvallance/mongo-k8s-sidecar site, I noticed that it says:

 

In its default configuration the sidecar uses the pods' IPs for the MongodDB replica names. Here is a trimmed example...

The table of settings at https://github.com/cvallance/mongo-k8s-sidecar#settings also mentions the KUBERNETES_MONGO_SERVICE_NAME environment variable and the page later says

If you want to use the StatefulSets' stable network ID, you have to make sure that you have the KUBERNETES_MONGO_SERVICE_NAME environmental variable set. Then the MongoDB replica set node names could look like this...

My understanding is that your app server is running internally in the k8s cluster, so you should be able to set the KUBERNETES_MONGO_SERVICE_NAME environment variable, which would cause the nodes to report hostnames in their isMaster responses and allow the driver to track hostnames instead of IP addresses. Is this correct? If you do make this change, you can verify that everything is correct via the isMaster response from any of the nodes.

 

– Divjot

 

Comment by Ben Birt [ 13/Mar/20 ]

Sure, I've attached various files, let me know if you need something more. (Note that the StorageClass resource in the mongo YAML assumes use of GKE.)

Comment by Divjot Arora (Inactive) [ 13/Mar/20 ]

Apologies for the back and forth on this ticket, but I have to ask for some more information to figure out where the IP addresses are coming from. Can you provide the following:

  1. Responses for the isMaster command from the cluster primary and a secondary. You can get these by connecting to the primary/secondary separately and running db.adminCommand({isMaster: 1})
  2. Your k8s YAML files

Given all of this, I'm going to try to reproduce this locally and see if anything sticks out. 

Comment by Ben Birt [ 13/Mar/20 ]

Thanks for the update!

 

That's right, those hostnames are from Kubernetes. If it's any help, I followed this guide (roughly, with a couple of edits) to run Mongo on k8s: https://kubernetes.io/blog/2017/01/running-mongodb-on-kubernetes-with-statefulsets/. If it's helpful, I can also share our k8s YAML with you.

Comment by Divjot Arora (Inactive) [ 12/Mar/20 ]

ben@dataform.co Thanks for the error output, it's definitely helpful. One thing that's confusing me is why there are hardcoded IP addresses in the error output. Using an Atlas cluster, I've verified that the driver does not resolve hostnames to IP addresses at any point. Every time we make a new connection, we pass the hostname from the original connection string (in your case, this would be something like "mongo-0.mongo") to net.Dialer.DialContext(), which presumably does the correct DNS lookup. I'm still investigating on our end and just wanted to give you an update of where I am.

Can you give any insight on where the hostnames like "mongo-0.mongo" come from? Are they generated by Kubernetes?

Comment by Ben Birt [ 12/Mar/20 ]

Here's an example error, hopefully it helps! 

{{server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [

{ Addr: 10.8.4.214:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : connection(10.8.4.214:27017[-16521]) incomplete read of message header: read tcp 10.8.3.222:59930->10.8.4.214:27017: read: connection reset by peer }

, { Addr: 10.8.3.221:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : connection(10.8.3.221:27017[-16520]) incomplete read of message header: read tcp 10.8.3.222:38548->10.8.3.221:27017: read: connection reset by peer }, { Addr: 10.8.8.146:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : connection(10.8.8.146:27017[-16519]) incomplete read of message header: read tcp 10.8.3.222:49938->10.8.8.146:27017: read: connection reset by peer }, ] } }}

Comment by Divjot Arora (Inactive) [ 11/Mar/20 ]

Hi ben@dataform.co,

Can you post the full output of the error you get when running queries?

Generated at Thu Feb 08 08:36:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.