[GODRIVER-2525] Occasional handshake error when using mongodb+srv with mongos pool Created: 16/Aug/22  Updated: 27/Oct/23  Resolved: 23/Dec/22

Status: Closed
Project: Go Driver
Component/s: Connections, Error Handling
Affects Version/s: 1.9.1
Fix Version/s: None

Type: Bug Priority: Unknown
Reporter: Peter Ivanov Assignee: Benji Rewis (Inactive)
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Summary

About once a day, we see an error like this: connection() error occurred during connection handshake: dial tcp: lookup foo-bar-mongos.svc.cluster.local on 169.254.25.10:53: no such host

We are using 1.9.1 mongo driver with the following setup:

  • sharded cluster
  • mongos instances are run as an auto-scaled pool
  • access to mongos is via SRV record

Due to how relatively rare these errors are, we assume they take place when one of mongos instances are either starting or shutting down.  

Our guess is that the nature of the issue is in a race between SRV and A records, possibly coupled with DNS caches etc. And this seems like the kind of issue that is better handled inside a driver itself. 

At this time we can propose no trivial WTR for this issue. If we can be of any help with diagnosing the issue, such as enabling verbose logs and sending them to you, feel free to give instructions. 



 Comments   
Comment by PM Bot [ 23/Dec/22 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by Benji Rewis (Inactive) [ 08/Dec/22 ]

The Go driver team does not feel that allowing a configurable rescanSRVInterval is a great fix for this situation. While we believe that "knob" does reduce the number of SRV lookup errors, we also think that GODRIVER-2579 will almost entirely remove the possibility of errors like the ones you're seeing being raised to users. We'd rather not expose new API to users to help avoid odd driver behavior, as that API will likely become irrelevant and permanent (removing it post hoc would be backward-breaking) after we've fixed the odd driver behavior. If you're intent on using a reduced SRV rescan interval, we may ask you to rely on your fork of v1.9.4 for the time being.

Comment by Artem Navrotskiy [ 07/Dec/22 ]

it may be difficult for most users to reason about which value to use for rescanSRVIntervalMS

I don't see this as a problem. If you don't explicitly need to change this parameter, then just leave default value. In the current situation, even in order to just look at the interval, you need to get into the code.

Are you still seeing this issue and is that open PR from your team?

After reducing the interval from 60 seconds to 30, the error still remained, but the probability of its occurrence decreased several times (about 5 times, but I don't remember the exact numbers).

Have you updated your version of the Go driver beyond 1.9.1?

Now we use version 1.9.4 with changes from PR.

Comment by Benji Rewis (Inactive) [ 05/Dec/22 ]

Hello again, petr.ivanov.s@gmail.com. I'm following up on this ticket, as there seems to be an open PR related to this issue. Is the author someone from your team?

While making the SRV rescan interval configurable may feasibly solve this issue for you all, we're hesitant to introduce a new "knob" to the driver: adding a URI option/client option for rescanSRVIntervalMS would be a cross-drivers change, and it may be difficult for most users to reason about which value to use for rescanSRVIntervalMS. We have an upcoming change GODRIVER-2579/GODRIVER-2191 (retrying operations if the connection handshake fails) that would probably stop these SRV lookup errors from bubbling up to your application. We would simply retry the handshake, and the retry would probably succeed given the sequence of events matt.dale@mongodb.com describes in his comment.

I have three questions:

  1. Are you still seeing this issue and is that open PR from your team?
  2. Have you updated your version of the Go driver beyond 1.9.1?
  3. How do you feel about waiting for GODRIVER-2579/2191 (which is currently planned in this quarter) to resolve this issue?
Comment by PM Bot [ 16/Nov/22 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by Matt Dale [ 01/Nov/22 ]

petr.ivanov.s@gmail.com we recently discovered a bug in the SRV polling behavior of the Go Driver that may prevent changes in SRV records from updating the servers that the Go Driver attempts to connect to when the MongoDB connection string includes a username and password (see GODRIVER-2620 for more details). We've fixed the bug and are planning to release the fix with Go Driver versions 1.8.6, 1.9.3, 1.10.4, and 1.11.0 tomorrow.

Do you use a username and password in your MongoDB connection string? If so, please update to one of the fix versions listed above as soon as they are available and see if that prevents or reduces the handshake errors you see. Since you're using 1.9.1, I recommend updating to version 1.9.3 since it will be the least risky change.

As far as server behavior, MongoDB 5.0 added a "quiesce" mode that's used during shutdown to allow connected drivers to gracefully remove the shutting down servers (read more about quiesce mode here). If updating to a patched Go Driver version doesn't help, updating to MongoDB 5.0 may help.

Comment by Peter Ivanov [ 25/Oct/22 ]

Question 1: no, we use pretty much bare bone cluster on AWS EC2 instances. Mongos-es are run in Kubernetes and scale according to load. 

For question 2, I'll as a colleague to answer, but it's worth noting that we have MongoDB 4.4, and shutdown handling may have improved since then. But the issue may not be with graceful shutdown alone. 

Comment by Matt Dale [ 13/Oct/22 ]

Hey petr.ivanov.s@gmail.com, sorry about the slow reply. I've been attempting to reproduce the error you described but have so far been unsuccessful. However, I have a possible sequence of events that could lead to the error:

  1. Initialize a mongo.Client with a "mongodb+srv://" scheme URI that specifies hosts [mongos1, mongos2, mongos3]. The mongo.Client creates monitoring connections to hosts [mongos1, mongos2, mongos3] and determines that they are all valid mongos instances.
  2. Kubernetes removes pod mongos3, removes the associated DNS record mongos3.svc.cluster.local, and removes mongos3.svc.cluster.local from the associated SRV record.
  3. Run an operation using the mongo.Client, which selects host mongos3 for the operation. The mongo.Client still considers mongos3.svc.cluster.local valid because it hasn't received any signals from that host.
  4. The mongo.Client attempts to create a new connection to mongos3.svs.cluster.local and encounters an error like

    dial tcp: lookup mongos3.svc.cluster.local on 169.254.25.10:53: no such host
    

  5. The mongo.Client marks mongos3.svc.cluster.local as "Unknown" and prevents it from being selected for subsequent operations.
  6. The mongo.Client polls for the SRV record from the original "mongodb+srv://" URI, sees that mongos3.svc.cluster.local is removed, and removes it from its list of servers.

 

Based on that, I have a few more questions:

  1. Are you using either the MongoDB Enterprise or Community Kubernetes Operator to manage your MongoDB cluster?
  2. How is Kubernetes shutting down the mongos process in the pods?
    Typically if mongos is shut down gracefully (e.g. shut down via sending SIGTERM or SIGINT or by running db.shutdownServer()), it signals to connected drivers that it is shutting down before it becomes unavailable. However, it sounds like that is not happening, possibly indicating that mongos is not shutting down gracefully.
Comment by Peter Ivanov [ 29/Aug/22 ]
  • Yes, both backend service and mongos run in Kubernetes
  • Routing is done via a headless service
  • Such errors are not very numerous, so if by 'many' you mean thousands, then no, we rarely see more than a dozen a minute from all our services
  • We use mongod 4.4.10 and most of mongos pools is 4.4.6
Comment by Matt Dale [ 26/Aug/22 ]

Hey petr.ivanov.s@gmail.com thanks for the ticket, we're looking into it!

I've got a few questions to help me troubleshoot the issue:

  • Based on the provided hostname, your mongos pool appears to be running in Kubernetes. Is that correct?
  • When you see those errors, do you typically see a single error or many errors within a small window?
  • What version of MongoDB are you connecting to?
Generated at Thu Feb 08 08:38:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.