[GODRIVER-1297] Opening new connections takes a long time Created: 04/Sep/19 Updated: 23/Oct/19 Resolved: 23/Oct/19 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | Connections |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bradley Wilson-Hunt | Assignee: | Divjot Arora (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Alpine Docker x3 Pods Cluster is hosted on mongo atlas in the same region + same cloud provider as the kubernetes cluster |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
We have been migrating our applications from using the community mgo diriver to the official mongo driver. Some of our higher traffic applications (100-200rps) have started to experiance sporadic issues in production. We are also able to replicate the same problem under a loadtest running in our staging environment. After much debugging it seems to be the driver doesnt have a connection available to the database so it attempts to open a new connection. However we are seeing that operation take upwards of 3 seconds in some cases (We have pprof outputs showing 1.6+ seconds). The issue is isolated to individual pods, no corrolation between connecting AZ etc.. To counter the issue we tried to specifiy the minPoolSize which was added in v1.1.0, however if this option is specified the driver doesnt actually start (LINK ISSUE HERE). We also tried from master (22646d953d8106e567b1da9aab98b627a2fb204f) and driver is able to connect to mongo but then panics (LINK ISSUE HERE) Here you can see the case im trying to describe. All of the time seems to be taken in sha1.blockAMD64:
A pprof taken from one of the other pods that didn't have any issues at the same time:
It doesnt seem like i can attach the pprof profiles here.. but if you need access to them i can email or send over direct message.
We are connecting to mongo in the following way: https://gist.github.com/BradErz/a947198bddf537532190fdb5ea015af3#file-mongoconnection-go
This is the code handling the query: https://gist.github.com/BradErz/a947198bddf537532190fdb5ea015af3#file-persistence-go
If we can help by providing any more information/debugging please let me know.
|
| Comments |
| Comment by Divjot Arora (Inactive) [ 23/Oct/19 ] | |||||||||||||||||||||||
|
Our recommendation is that you use the min pool size option because that will allow connections to be authenticated in the background. This should resolve the issue of new connection creation taking a long time. For now, I'm going to close this ticket. Feel free to re-open or open another ticket if you run into any other issues. – Divjot | |||||||||||||||||||||||
| Comment by Bradley Wilson-Hunt [ 16/Sep/19 ] | |||||||||||||||||||||||
|
Sorry for taking so long to get back to this..
I pushed some code that is able to replicate the issue: https://github.com/BradErz/test-mongo
If something else is needed please let me know.
Kind Regards, Bradley | |||||||||||||||||||||||
| Comment by Bradley Wilson-Hunt [ 05/Sep/19 ] | |||||||||||||||||||||||
|
We are calling `GetDatabase` once in the initilization and then adding that to a struct in the persitence layer.
The `mongoclient` pkg you see there is just the code i pasted before in the example from GetDatabase. We just wrap the libary for convenience reasons. All our applications only use 1 database and then from that mongoclient.Database they use diffrent collections. Hope that makes sense.
| |||||||||||||||||||||||
| Comment by Bradley Wilson-Hunt [ 05/Sep/19 ] | |||||||||||||||||||||||
|
Also its worth noting some of our other applications are actually using this driver in production but then suffer from sporadic issues where the connection pool is starved and then needs to open a new connection and then that pod becomes useless for a while.
As you can see here all pods were recieving the same RPS but all of a sudden one of the pods went crazy with the CPU, latency and errors.
When collecting the pprof we see its always blocking at that point. The error wont go away on its own, the pod needs to be killed.
| |||||||||||||||||||||||
| Comment by David Golden [ 05/Sep/19 ] | |||||||||||||||||||||||
|
I'll let the driver team respond further about the driver pool details. I do see in your example code that you have GetDatabase calling GetClient. If you aren't persisting the database objects or if you are creating lots of databases, then each of those is going to have a Client that needs to connect the first time it's used. Assuming that you have the same credentials and client options across requests, I would generally expect an app to have a single Client object persisted somewhere and construct Database objects from it as needed, so that all your Database objects are reusing the same connection pool. If you do that, you can initialize your Client at start up and warm up a pool (until the min pool size option is fixed) so that you aren't trying to make connections while servicing requests. | |||||||||||||||||||||||
| Comment by Bradley Wilson-Hunt [ 05/Sep/19 ] | |||||||||||||||||||||||
|
Hi David! Thanks for replying, that does make sense. Are connections only opened within the driver when needed or does it always keep a buffer? And how does that compare to the behavior from mgo when dealing with the connection?
Just as an example this is what happens when releasing the official mongo driver in favor of the mgo one. The issue doesn't seem to fix itself either, i guess because the context is cancelled due to the timeout from our api-gateway it never successfully opens a new connection to the database so we are always at the minimum which from what i understand from the driver code is 0 (maybe 1 or 2 connections are opened but its obviously not enough to deal with that load). We didn't experiance any behaviour like this with the mgo driver. Do you have any recomendations going forward for this?
Kind Regards, Bradley | |||||||||||||||||||||||
| Comment by David Golden [ 04/Sep/19 ] | |||||||||||||||||||||||
|
Hi. I'm xdg on Github. Just for context, the PBKDF2 iterations are an intentional work factor to increase computational costs, just like with bcrypt, to make brute force attacks harder. So seeing delays there is by design and is also affected by the scramSHA256IterationCount setting (or scramIterationCount, if you're using SCRAM-SHA-1). | |||||||||||||||||||||||
| Comment by Bradley Wilson-Hunt [ 04/Sep/19 ] | |||||||||||||||||||||||
|
It seems the underlying libary that this is coming from is https://github.com/xdg-go/scram/blob/d4e428aa08f9fb604bbbe12dc0f1c842f9d7443e/client.go#L108 | |||||||||||||||||||||||
| Comment by Bradley Wilson-Hunt [ 04/Sep/19 ] | |||||||||||||||||||||||
|
I can't seem to edit the post so the (LINK ISSUE HERE) will be here instead Panic in current master (22646d953d8106e567b1da9aab98b627a2fb204f): https://jira.mongodb.org/browse/GODRIVER-1298
1.1.0 Doesnt start if minPoolSize is specified: |