[GODRIVER-1613] fix how Pool handles MaxPoolSize Created: 10/May/20 Updated: 28/Oct/23 Resolved: 05/Jun/20 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | None |
| Affects Version/s: | 1.3.2 |
| Fix Version/s: | 1.3.5 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Gaston Gregorio Blas Acevedo | Assignee: | Isabella Siu (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
MacBook Pro with macOs Catalina. Version 10.15.4 (19E287) |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Under heavy load, using a context with timeout in a find operation, if timeout is reached, derives in a weird behavior of driver connection pool. To be clear, this is a sniped of code tested:
where: Two files attached: |
| Comments |
| Comment by Githook User [ 11/Jun/20 ] | ||||||||||||||||||||||
|
Author: {'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}Message:
| ||||||||||||||||||||||
| Comment by Githook User [ 11/Jun/20 ] | ||||||||||||||||||||||
|
Author: {'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}Message:
| ||||||||||||||||||||||
| Comment by Githook User [ 05/Jun/20 ] | ||||||||||||||||||||||
|
Author: {'name': 'iwysiu', 'email': 'isabella.siu@10gen.com', 'username': 'iwysiu'}Message: | ||||||||||||||||||||||
| Comment by Githook User [ 05/Jun/20 ] | ||||||||||||||||||||||
|
Author: {'name': 'iwysiu', 'email': 'isabella.siu@10gen.com', 'username': 'iwysiu'}Message: | ||||||||||||||||||||||
| Comment by Isabella Siu (Inactive) [ 04/Jun/20 ] | ||||||||||||||||||||||
|
We're trying to finish this ticket up by tomorrow morning. The fix in | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 04/Jun/20 ] | ||||||||||||||||||||||
|
Hi Isabella, Please can you please, give me some data about the bug in review? TY! | ||||||||||||||||||||||
| Comment by Isabella Siu (Inactive) [ 28/May/20 ] | ||||||||||||||||||||||
|
As an update, one of the two bugs has been fixed and the other one is currently going through review. | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 27/May/20 ] | ||||||||||||||||||||||
|
Hi Isabella, any update on this topic? TY! | ||||||||||||||||||||||
| Comment by Isabella Siu (Inactive) [ 21/May/20 ] | ||||||||||||||||||||||
|
As an update, we're currently working on bug fixes that should reduce the number of opened and closed connections. Thanks for all the information! | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 20/May/20 ] | ||||||||||||||||||||||
|
We've found a potential bug with MaxPoolSize that we are currently investigating. However, regarding the timeouts, an expired context timeout must close the socket and we can't change that behavior. I don't feel in position to ask you for a change like don't closing socket at context timeout. you mention that your mongod/mongos servers don't show atypical resource consumption Our monitoring show us a server's consumption of resources in the expected range for the system's load, and our timeout is inflexible, we can't use a best effort timeout as FindOption's maxTime, so we use the context to end any call reaching the maximum time that we can wait without compromise the SLA. but the next sentence says that the servers' load pushes response times until the context expires. If the mongod/mongos server has ample resources, can you provide any guesses as to why it's not responding within the specified MaxTime option In peak hours our maximum timeout is close to average MongoDB's response time.
| ||||||||||||||||||||||
| Comment by Isabella Siu (Inactive) [ 19/May/20 ] | ||||||||||||||||||||||
|
We've found a potential bug with MaxPoolSize that we are currently investigating. However, regarding the timeouts, an expired context timeout must close the socket and we can't change that behavior. Can you provide more explanation about your previous comment? You mention that your mongod/mongos servers don't show atypical resource consumption, but the next sentence says that the servers' load pushes response times until the context expires. If the mongod/mongos server has ample resources, can you provide any guesses as to why it's not responding within the specified MaxTime option? | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 19/May/20 ] | ||||||||||||||||||||||
|
Hi Isabella, Thank you in advance | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 14/May/20 ] | ||||||||||||||||||||||
|
Monitoring our servers don't show an atypical resource consumption, in fact never goes over 30% or 40% of Memory or CPU available, but we have a hard constraint in response time and must honor it. So our timeouts are inflexible.
| ||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 14/May/20 ] | ||||||||||||||||||||||
|
You're setting both but still seeing connection churn? Is it possible that the server is under too much load to respond to requests in a timely manner and by the time it can send a response, the context timeout has expired so the driver has closed the connection? | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 14/May/20 ] | ||||||||||||||||||||||
|
Hi Divjot, Thanks! | ||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 14/May/20 ] | ||||||||||||||||||||||
|
Thank you for all of the information! One thing to note is that the server can sometimes report more connections than the driver actually has open if it is under load and has not been able to prune the connections that were closed by the driver. Of course, this doesn't mean there isn't also a bug with MaxPoolSize in the driver. Our plan is to investigate this further and see if we can find cases that cause the driver to create more connections than MaxPoolSize. I'm hoping to get an answer to you early next week. In the meantime, you mentioned that MaxTime works OK for your use case. Can you try setting both MaxTime and a context timeout? I'd recommend setting MaxTime to something slightly shorter than the context timeout so the server has enough time to send a response before the driver receives a network timeout error and closes the connection. | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 13/May/20 ] | ||||||||||||||||||||||
|
The mongostat reported is connected to a stand alone MongoDB instance, running on my machine for test, so minPoolSize/maxPoolSize is absolute, i guess. I give you the information of our production clusters just to show you the heterogeneous topologies where the same problem shows up. I have a golang project to test this at: https://github.com/n4d13/mongo_driver_test | ||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 12/May/20 ] | ||||||||||||||||||||||
|
The pool size applies to the each node, not the total number of connections to the cluster overall. So that means if you have max pool size of 100 and a 5 node cluster, the driver will create up to 500 application connections total. This will actually be something like 505 connections reported by mongostat because we also keep one dedicated connection per-node that's used for monitoring and detecting state changes. I might be misunderstanding something from your response, but we need to know if the mongostat connection is to a single node or to the cluster as a whole, as this will influence the number of connections reported by the tool. See https://docs.mongodb.com/manual/reference/program/mongostat/#cmdoption-mongostat-host for some more information about this. Can you provide this information? | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 12/May/20 ] | ||||||||||||||||||||||
|
Hi Divjot, The Topology for MongoDB in production is: The Topology for DocumentDB in production is: On local, im running with specifications: That mongostat, logs and tests results are the result of running tests in local. Thanks in advance! | ||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 12/May/20 ] | ||||||||||||||||||||||
|
How many nodes are in the cluseter? Is mongostat connected directly to a mongod/mongos or is it making a replica set connection, where it auto-discovers the nodes? | ||||||||||||||||||||||
| Comment by Gaston Gregorio Blas Acevedo [ 12/May/20 ] | ||||||||||||||||||||||
|
Hi Divjot, ty for your response. At bottom i pasted mongostat output 'cause Jira don't let me attach another file. Sorry.
| ||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 11/May/20 ] | ||||||||||||||||||||||
|
Thank you for the detailed report. We don't think this is a bug. The driver implements context timeouts by translating the context deadline to a socket timeout using the the SetReadDeadline and SetWriteDeadline functions on the net.Conn interface. If the read or write gets a socket timeout error, we have to close the connection for two reasons:
As you mentioned, the MaxTime option would be better for this use-case, as it only limits the amount of time the server spends on the operation and does not cause the socket to be closed if the operation fails. Note that you may want to use both MaxTime and a context timeout (or the SocketTimeout Client option) because MaxTime is best-effort. The socket timeout should be an upper-bound. If the server hasn't replied by then, we can close the connection. We do have an ongoing drivers-wide project to consolidate the many drivers timeout options. The project is still being scoped/designed, but part of it may involve using the context timeout or other timeouts provided through various options to set maxTimeMS on commands we send to the server to avoid this kind of connection churn. Does this clarify what's going on in the driver and why MaxTime works for your use case? – Divjot |