[GODRIVER-1613] fix how Pool handles MaxPoolSize Created: 10/May/20  Updated: 28/Oct/23  Resolved: 05/Jun/20

Status: Closed
Project: Go Driver
Component/s: None
Affects Version/s: 1.3.2
Fix Version/s: 1.3.5

Type: Bug Priority: Major - P3
Reporter: Gaston Gregorio Blas Acevedo Assignee: Isabella Siu (Inactive)
Resolution: Fixed Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MacBook Pro with macOs Catalina. Version 10.15.4 (19E287)
processor: 2,2 GHz Quad-Core Intel Core i7
memory: 16 GB 1600 MHz DDR3
Golang version: go1.13.5 darwin/amd64
Database version: 3.6.17 runnning on docker for tests


Attachments: Zip Archive Test_result.zip     Zip Archive console.zip    
Issue Links:
Related
related to GODRIVER-1620 pool cleared on network timeout errors Closed

 Description   

Under heavy load, using a context with timeout in a find operation, if timeout is reached, derives in a weird behavior of driver connection pool.
Connections starts to be discarded by timeout and pool fall in a snow ball of discard/reconnect attemps until driver panics. 
In tests running about 2 minutes, we observe, using a pool monitor, about 27k opened  and 26k closed connections, with a minPoolSize of 30 and a maxPoolSize of 100.

To be clear, this is a sniped of code tested:

filter := bson.M{"store_id": bson.M{"$in": idsList}}
ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
defer cancel()
 
fOptions := options.Find().
   SetMaxTime(500*time.Millisecond).
   SetBatchSize(int32(len(idsList)))
 
records, err := m.storesCollection.Find(ctx, filter, fOptions)

where:
idList is a variable length slice of ids to search (indexed field, unique)
ctx: Problematic context with timeout
fOptions–>MaxTime: works ok

Two files attached:
1- Test result with metrics and error messages
2- console output with traces



 Comments   
Comment by Githook User [ 11/Jun/20 ]

Author:

{'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}

Message: GODRIVER-1613 Improve stability of flaky test (#420)

Comment by Githook User [ 11/Jun/20 ]

Author:

{'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}

Message: GODRIVER-1613 Improve stability of flaky test (#420)

Comment by Githook User [ 05/Jun/20 ]

Author:

{'name': 'iwysiu', 'email': 'isabella.siu@10gen.com', 'username': 'iwysiu'}

Message: GODRIVER-1613 fix how pool handles MaxPoolSize (#413)
Branch: release/1.3
https://github.com/mongodb/mongo-go-driver/commit/d89ba217da09bec04910dc1c0cead5288021c540

Comment by Githook User [ 05/Jun/20 ]

Author:

{'name': 'iwysiu', 'email': 'isabella.siu@10gen.com', 'username': 'iwysiu'}

Message: GODRIVER-1613 fix how pool handles MaxPoolSize (#413)
Branch: master
https://github.com/mongodb/mongo-go-driver/commit/bbe8c4a0adeb6fc5095d0aa53a7d32a025d368c9

Comment by Isabella Siu (Inactive) [ 04/Jun/20 ]

Hi gaston.acevedo@rappi.com,

We're trying to finish this ticket up by tomorrow morning. The fix in GODRIVER-1620 was released in v1.3.4, and we believe that it should help with your problem.

Comment by Gaston Gregorio Blas Acevedo [ 04/Jun/20 ]

Hi Isabella,

Please can you please, give me some data about the bug in review?
I can see only one related ticket, already closed: https://jira.mongodb.org/browse/GODRIVER-1620

TY!

Comment by Isabella Siu (Inactive) [ 28/May/20 ]

Hi gaston.acevedo@rappi.com,

As an update, one of the two bugs has been fixed and the other one is currently going through review.

Comment by Gaston Gregorio Blas Acevedo [ 27/May/20 ]

Hi Isabella,

any update on this topic?

TY!

Comment by Isabella Siu (Inactive) [ 21/May/20 ]

Hi gaston.acevedo@rappi.com,

As an update, we're currently working on bug fixes that should reduce the number of opened and closed connections. Thanks for all the information!

Comment by Gaston Gregorio Blas Acevedo [ 20/May/20 ]

We've found a potential bug with MaxPoolSize that we are currently investigating. However, regarding the timeouts, an expired context timeout must close the socket and we can't change that behavior.

I don't feel in position to ask you for a change like don't closing socket at context timeout.
We need a connection pool handling the timeout in a more gentle and predictable manner.
As shown in one of my previous comments, when reach the context timeout, mongostat report connections moving each second between 45, 18, 371, 13, 79, 163, 151, 308. This behaviour translates directly to a microservice auto-scaling (at best) and our infrastructure alarms firing by the miss-behaving instance and by the global count of connections open.


you mention that your mongod/mongos servers don't show atypical resource consumption

Our monitoring show us a server's consumption of resources in the expected range for the system's load, and our timeout is inflexible, we can't use a best effort timeout as FindOption's maxTime, so we use the context to end any call reaching the maximum time that we can wait without compromise the SLA. 


but the next sentence says that the servers' load pushes response times until the context expires. If the mongod/mongos server has ample resources, can you provide any guesses as to why it's not responding within the specified MaxTime option

In peak hours our maximum timeout is close to average MongoDB's response time.
Sometimes, a query can excede the timeout and that call is dropped and replaced with a response from the service's fallback (probably stale data, but data at least).
The query issued to MongoDB is very simple, find documents by ids' list, using a $in, but the length of ids list is variable from one hundred to four hundred more or less (size depends on user's context) on each call.
The querys with larger number of documents to return are the most susceptible to reach our timeout.

 

 

Comment by Isabella Siu (Inactive) [ 19/May/20 ]

Hi gaston.acevedo@rappi.com,

We've found a potential bug with MaxPoolSize that we are currently investigating. However, regarding the timeouts, an expired context timeout must close the socket and we can't change that behavior.

Can you provide more explanation about your previous comment? You mention that your mongod/mongos servers don't show atypical resource consumption, but the next sentence says that the servers' load pushes response times until the context expires. If the mongod/mongos server has ample resources, can you provide any guesses as to why it's not responding within the specified MaxTime option?

Comment by Gaston Gregorio Blas Acevedo [ 19/May/20 ]

Hi Isabella,
Please, can you tell me the next steps?
We have a lot of expectation about this.

Thank you in advance

Comment by Gaston Gregorio Blas Acevedo [ 14/May/20 ]

Monitoring our servers don't show an atypical resource consumption, in fact never goes over 30% or 40% of Memory or CPU available, but we have a hard constraint in response time and must honor it. So our timeouts are inflexible.
The problem arises when servers' load pushes responses times until reach our timeouts and the described behaviour starts:

  • Continuos recicle of connections
  • Gets from connection's pool stuck
  • A big number of connections trying to authenticate at same time
  • Application's code failing and so on.
Comment by Divjot Arora (Inactive) [ 14/May/20 ]

You're setting both but still seeing connection churn? Is it possible that the server is under too much load to respond to requests in a timely manner and by the time it can send a response, the context timeout has expired so the driver has closed the connection?

Comment by Gaston Gregorio Blas Acevedo [ 14/May/20 ]

Hi Divjot,
We already set both maxTime in findOptions and contextTimeout, having a contextTimeout with more time than maxTime timeout. Additionally, we were forced to scale up our microservices instances to distribute timeouts and try to keep things running in a barely aceptable way, but clearly this is not the way to go.

Thanks! 

Comment by Divjot Arora (Inactive) [ 14/May/20 ]

Thank you for all of the information! One thing to note is that the server can sometimes report more connections than the driver actually has open if it is under load and has not been able to prune the connections that were closed by the driver. Of course, this doesn't mean there isn't also a bug with MaxPoolSize in the driver.

Our plan is to investigate this further and see if we can find cases that cause the driver to create more connections than MaxPoolSize. I'm hoping to get an answer to you early next week. In the meantime, you mentioned that MaxTime works OK for your use case. Can you try setting both MaxTime and a context timeout? I'd recommend setting MaxTime to something slightly shorter than the context timeout so the server has enough time to send a response before the driver receives a network timeout error and closes the connection.

Comment by Gaston Gregorio Blas Acevedo [ 13/May/20 ]

The mongostat reported is connected to a stand alone MongoDB instance, running on my machine for test, so minPoolSize/maxPoolSize is absolute, i guess.

I give you the information of our production clusters just to show you the heterogeneous topologies where the same problem shows up.

I have a golang project to test this at: https://github.com/n4d13/mongo_driver_test
where i reproduce the usage scenario.
Feel free to use it

Comment by Divjot Arora (Inactive) [ 12/May/20 ]

The pool size applies to the each node, not the total number of connections to the cluster overall. So that means if you have max pool size of 100 and a 5 node cluster, the driver will create up to 500 application connections total. This will actually be something like 505 connections reported by mongostat because we also keep one dedicated connection per-node that's used for monitoring and detecting state changes.

I might be misunderstanding something from your response, but we need to know if the mongostat connection is to a single node or to the cluster as a whole, as this will influence the number of connections reported by the tool. See https://docs.mongodb.com/manual/reference/program/mongostat/#cmdoption-mongostat-host for some more information about this.

Can you provide this information?

Comment by Gaston Gregorio Blas Acevedo [ 12/May/20 ]

Hi Divjot, 
We have a set of 9 production deployments with MongoDB version 3.6 in AWS, 9 production deployments using DocumentDB in AWS compatibility with version 3.6 of MongoDB certified, and a test running local on my machine. All of them show the same behaviour.

The Topology for MongoDB in production is:
Replicaset of 5 nodes -  r5.4xlarge (4 cores, 32Gb RAM, network 10Gbps, bandwidth of storage 4750 mbps)

The Topology for DocumentDB in production is:
Replicaset of 5 nodes - db.m5.8xlarge for writer and 4 db.m5.xlarge for readers

On local, im running with specifications:
MacBook Pro with macOs Catalina. Version 10.15.4 (19E287)
processor: 2,2 GHz Quad-Core Intel Core i7
memory: 16 GB 1600 MHz DDR3
Golang version: go1.13.5 darwin/amd64
Database version: 3.6.17 running on docker for tests

That mongostat, logs and tests results are the result of running tests in local.
To help you i can share the tests project developed to investigate this case, just tell me.

Thanks in advance!

Comment by Divjot Arora (Inactive) [ 12/May/20 ]

How many nodes are in the cluseter? Is mongostat connected directly to a mongod/mongos or is it making a replica set connection, where it auto-discovers the nodes?

Comment by Gaston Gregorio Blas Acevedo [ 12/May/20 ]

Hi Divjot, ty for your response.
Just to clarify about our scenario... we have a hard time constraint and best effort timeout translates to, for example, opened circuits breakers, fallbacks called, increased response times and so on.
Maybe, some timeouts don't hurts, but when we reach this timeout scenario, suddenly we face the connections discard/reconnect mechanics with mongostat showing us an erratic behaviour.
Having a pool configured with min = 30 and max = 100 connections, and seeing, second to second, mongostat's report of connections open:
18
371
13
79
163
151
308
and process died..

At bottom i pasted mongostat output 'cause Jira don't let me attach another file. Sorry.

 

insert query update delete getmore command dirty used flushes vsize   res qrw  arw net_in net_out conn                time
    *0    24     *0     *0       0   594|0  0.0% 1.8%       0 1.21G  118M 0|0  1|0   390k   3.53m   23 May 12 01:32:53.069
    *0    30     *0     *0       0   708|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0   473k   4.08m   39 May 12 01:32:54.139
    *0    39     *0     *0       5   716|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0   544k   4.93m   14 May 12 01:32:55.075
    *0    45     *0     *0       5   737|0  0.0% 1.8%       0 1.20G  118M 0|0  4|0   668k   5.86m   22 May 12 01:32:56.036
    *0    52     *0     *0       2   683|0  0.0% 1.8%       0 1.21G  118M 0|0  1|0   730k   7.76m   45 May 12 01:32:57.056
    *0    37     *0     *0       6   720|0  0.0% 1.8%       0 1.20G  118M 0|0  3|0   543k   4.65m   18 May 12 01:32:58.040
    *0     9     *0     *0       0   428|0  0.0% 1.8%       0 1.52G  121M 0|0  1|0   199k   1.65m  371 May 12 01:32:59.118
    *0     6     *0     *0       0   164|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0   109k   1.00m   13 May 12 01:33:00.104
    *0    67     *0     *0       0   471|0  0.0% 1.8%       0 1.24G  119M 0|0  1|0   841k   9.15m   79 May 12 01:33:01.029
    *0    63     *0     *0       2   334|0  0.0% 1.8%       0 1.33G  120M 0|0  3|0   725k   7.76m  163 May 12 01:33:02.073
insert query update delete getmore command dirty used flushes vsize   res qrw  arw net_in net_out conn                time
    *0    21     *0     *0       0   534|0  0.0% 1.8%       0 1.31G  119M 0|0  1|0   326k   3.09m  151 May 12 01:33:02.995
    *0    11     *0     *0       0   414|0  0.0% 1.8%       0 1.48G  122M 0|0  1|0   229k   1.81m  308 May 12 01:33:04.229
    *0    *0     *0     *0       0   448|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0  98.8k    187k    2 May 12 01:33:04.996
    *0    *0     *0     *0       0   134|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0  29.5k   93.2k    2 May 12 01:33:05.996
    *0    *0     *0     *0       0     1|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0   157b   61.1k    2 May 12 01:33:07.002
    *0    *0     *0     *0       0    44|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0  9.47k   71.7k    2 May 12 01:33:07.998
    *0    *0     *0     *0       0     2|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0   158b   61.5k    2 May 12 01:33:08.997
    *0    *0     *0     *0       0     1|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0   157b   61.3k    2 May 12 01:33:09.999
    *0    *0     *0     *0       0    75|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0  16.3k   79.1k    2 May 12 01:33:10.996
    *0    *0     *0     *0       0     2|0  0.0% 1.8%       0 1.20G  118M 0|0  1|0   158b   61.4k    2 May 12 01:33:11.996

 

Comment by Divjot Arora (Inactive) [ 11/May/20 ]

Hi gaston.acevedo@rappi.com,

Thank you for the detailed report. We don't think this is a bug. The driver implements context timeouts by translating the context deadline to a socket timeout using the the SetReadDeadline and SetWriteDeadline functions on the net.Conn interface. If the read or write gets a socket timeout error, we have to close the connection for two reasons:

  1. To let the server know that it can stop doing the work for the operation. The server polls its open connections and can interrupt in-progress work at designated points if the client has closed the connection.
  2. To free up system resources that may be storing part or all of the server response. If the timeout happens after the server has already starting writing the response, we don't know the state of the socket. We can't reuse it because the server may already have written the response, so future reads would read responses from previous requests.

As you mentioned, the MaxTime option would be better for this use-case, as it only limits the amount of time the server spends on the operation and does not cause the socket to be closed if the operation fails. Note that you may want to use both MaxTime and a context timeout (or the SocketTimeout Client option) because MaxTime is best-effort. The socket timeout should be an upper-bound. If the server hasn't replied by then, we can close the connection.

We do have an ongoing drivers-wide project to consolidate the many drivers timeout options. The project is still being scoped/designed, but part of it may involve using the context timeout or other timeouts provided through various options to set maxTimeMS on commands we send to the server to avoid this kind of connection churn.

Does this clarify what's going on in the driver and why MaxTime works for your use case?

– Divjot

Generated at Thu Feb 08 08:36:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.