[GODRIVER-1297] Opening new connections takes a long time Created: 04/Sep/19  Updated: 23/Oct/19  Resolved: 23/Oct/19

Status: Closed
Project: Go Driver
Component/s: Connections
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bradley Wilson-Hunt Assignee: Divjot Arora (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Alpine Docker
CPU Request: 100m
CPU Limit: 100m
Memory Request: 50Mi
Memory Limit: 100Mi

x3 Pods
MaxPoolSize 100

Cluster is hosted on mongo atlas in the same region + same cloud provider as the kubernetes cluster


Attachments: PNG File image-2019-09-05-15-57-29-894.png     PNG File image-2019-09-05-16-00-03-997.png    
Issue Links:
Depends
depends on GODRIVER-1298 Panic in topology/pool.go:416 Closed
Case:

 Description   

We have been migrating our applications from using the community mgo diriver to the official mongo driver.

Some of our higher traffic applications (100-200rps) have started to experiance sporadic issues in production. We are also able to replicate the same problem under a loadtest running in our staging environment.

After much debugging it seems to be the driver doesnt have a connection available to the database so it attempts to open a new connection. However we are seeing that operation take upwards of 3 seconds in some cases (We have pprof outputs showing 1.6+ seconds). The issue is isolated to individual pods, no corrolation between connecting AZ etc..

To counter the issue we tried to specifiy the minPoolSize which was added in v1.1.0, however if this option is specified the driver doesnt actually start (LINK ISSUE HERE). We also tried from master (22646d953d8106e567b1da9aab98b627a2fb204f) and driver is able to connect to mongo but then panics (LINK ISSUE HERE)

Here you can see the case im trying to describe. All of the time seems to be taken in sha1.blockAMD64:

A pprof taken from one of the other pods that didn't have any issues at the same time:

 

It doesnt seem like i can attach the pprof profiles here.. but if you need access to them i can email or send over direct message.

 

 

We are connecting to mongo in the following way:

https://gist.github.com/BradErz/a947198bddf537532190fdb5ea015af3#file-mongoconnection-go

 

This is the code handling the query:

https://gist.github.com/BradErz/a947198bddf537532190fdb5ea015af3#file-persistence-go

 

If we can help by providing any more information/debugging please let me know.

 



 Comments   
Comment by Divjot Arora (Inactive) [ 23/Oct/19 ]

Hi b.wilsonhunt@faceit.com,

Our recommendation is that you use the min pool size option because that will allow connections to be authenticated in the background. This should resolve the issue of new connection creation taking a long time. For now, I'm going to close this ticket. Feel free to re-open or open another ticket if you run into any other issues.

– Divjot

Comment by Bradley Wilson-Hunt [ 16/Sep/19 ]

Sorry for taking so long to get back to this..

 

I pushed some code that is able to replicate the issue:

https://github.com/BradErz/test-mongo

 

If something else is needed please let me know.

 

Kind Regards,

Bradley

Comment by Bradley Wilson-Hunt [ 05/Sep/19 ]

We are calling `GetDatabase` once in the initilization and then adding that to a struct in the persitence layer. 

 

type service struct {
   log *logger.StandardLogger
   db  *mongoclient.Database
}
 
func New(log *logger.StandardLogger) (Database, error) {
   client, err := mongoclient.GetDatabase()
   if err != nil {
      log.WithError(err).Fatal("could not connect to mongoDB")
      return nil, err
   }
 
   return &service{
      log,
      client,
   }, nil
}
 
func (s service) GetAllMMs(ctx context.Context, filter map[string]interface{}) ([]matchmaking.MM, error) {
   ...
   cursor, err := s.db.Collection(mmCol).Find(ctx, filter)
   ...
}

 

The `mongoclient` pkg you see there is just the code i pasted before in the example from GetDatabase. We just wrap the libary for convenience reasons.

All our applications only use 1 database and then from that mongoclient.Database they use diffrent collections.

Hope that makes sense.

 

Comment by Bradley Wilson-Hunt [ 05/Sep/19 ]

Also its worth noting some of our other applications are actually using this driver in production but then suffer from sporadic issues where the connection pool is starved and then needs to open a new connection and then that pod becomes useless for a while.

As you can see here all pods were recieving the same RPS but all of a sudden one of the pods went crazy with the CPU, latency and errors.

 

 

When collecting the pprof we see its always blocking at that point.

The error wont go away on its own, the pod needs to be killed.

 

Comment by David Golden [ 05/Sep/19 ]

I'll let the driver team respond further about the driver pool details.

I do see in your example code that you have GetDatabase calling GetClient. If you aren't persisting the database objects or if you are creating lots of databases, then each of those is going to have a Client that needs to connect the first time it's used. Assuming that you have the same credentials and client options across requests, I would generally expect an app to have a single Client object persisted somewhere and construct Database objects from it as needed, so that all your Database objects are reusing the same connection pool. If you do that, you can initialize your Client at start up and warm up a pool (until the min pool size option is fixed) so that you aren't trying to make connections while servicing requests.

Comment by Bradley Wilson-Hunt [ 05/Sep/19 ]

Hi David!

Thanks for replying, that does make sense. Are connections only opened within the driver when needed or does it always keep a buffer? And how does that compare to the behavior from mgo when dealing with the connection?

Just as an example this is what happens when releasing the official mongo driver in favor of the mgo one. The issue doesn't seem to fix itself either, i guess because the context is cancelled due to the timeout from our api-gateway it never successfully opens a new connection to the database so we are always at the minimum which from what i understand from the driver code is 0 (maybe 1 or 2 connections are opened but its obviously not enough to deal with that load).

We didn't experiance any behaviour like this with the mgo driver.

Do you have any recomendations going forward for this?

 

Kind Regards,

Bradley

Comment by David Golden [ 04/Sep/19 ]

Hi.  I'm xdg on Github.  Just for context, the PBKDF2 iterations are an intentional work factor to increase computational costs, just like with bcrypt, to make brute force attacks harder.  So seeing delays there is by design and is also affected by the scramSHA256IterationCount setting (or scramIterationCount, if you're using SCRAM-SHA-1).

Comment by Bradley Wilson-Hunt [ 04/Sep/19 ]

It seems the underlying libary that this is coming from is https://github.com/xdg-go/scram/blob/d4e428aa08f9fb604bbbe12dc0f1c842f9d7443e/client.go#L108

Comment by Bradley Wilson-Hunt [ 04/Sep/19 ]

I can't seem to edit the post so the (LINK ISSUE HERE) will be here instead 

Panic in current master (22646d953d8106e567b1da9aab98b627a2fb204f): 

https://jira.mongodb.org/browse/GODRIVER-1298

 

1.1.0 Doesnt start if minPoolSize is specified:

https://jira.mongodb.org/browse/GODRIVER-1299

Generated at Thu Feb 08 08:36:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.