[JAVA-644] SocketException causes cleanup of the entire DBPortPool leading to OutOfMemoryErrors in high load applications Created: 18/Sep/12  Updated: 04/Dec/13  Resolved: 04/Dec/13

Status: Closed
Project: Java Driver
Component/s: Connection Management
Affects Version/s: 2.8.0, 2.9.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Gacki Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Oracle JVM, Linux 64bit



 Description   

Hi,

I have an application under high load connected to a MongoDB (2.0.7) using the Java Driver. When the DB fails to allocate memory (mmap), the connection on the client side is reset. The SocketException thrown is causing a cleanup of the entire pool of socket connections in the DBPortPool implementation. Is there a reason for that?

A cleanup of the entire pool leads to OutOfMemoryErrors since connection (de)allocation is expensive. A failure of a single DB operation can affect an entire client application.

Would it not be better to clean up just the connection that was reset?

Thanks



 Comments   
Comment by Jeffrey Yemin [ 04/Dec/13 ]

Thanks for the reply. Closing.

Jeff

Comment by Robert Gacki [ 04/Dec/13 ]

Hi Jeff,
no, I did no further load tests to reproduce the issue. And we did not see the issue in production. You can close the ticket.

Comment by Jeffrey Yemin [ 04/Dec/13 ]

Hi Robert,

Do you have any updates? If not, I'm going to close this ticket.

Comment by Robert Gacki [ 19/Jul/13 ]

Hi Jeff,

I'm not sure either. But it was reproducible at that time. I suspect the overhead to create new connections. Does the driver exchange data first, when connected? Maybe it's the buffers filling up all at once, when the HTTP requests hit the application.

Anyway, my project planned to do load tests again. I will bring that up again so we can analyse it further.

Best,
Robert

Comment by Jeffrey Yemin [ 12/Jul/13 ]

Robert,

Can you explain how controlling the rate of connection allocations will result in there being any less live objects on the heap in the end. It seems to me that ultimately we will end up with the same live heap, and you either have enough memory to hold it or you don't.

Comment by Robert Gacki [ 25/Jun/13 ]

1. Xmx was set to 3g.
2. I'm not sure anymore. But I think we had between 500 and up to 1.000 connections per host. Our current configurations use 500, though.

A common solution is to allow the pool to warm up by configuring a minimum pool size. The pool will then allocate new connections at a controlled rate (in a separate thread with a configurable rate) instead of being populated by a herd of requests. The pool will become available, when the allocation is finished. In my case, I could set that minimum size to match the pool's maximum size. It's better for me to wait seconds longer for the application to become available then to have it fail by an OOM and to have the JVM restarted.

Comment by Jeffrey Yemin [ 25/Jun/13 ]

A few questions:

  1. What have you set -Xmx to for java?
  2. What have you set connectionsPerHost to in MongoClientOptions?

If the primary goes down all the connections will be dropped by the server, so the connection pool needs to be cleared in any case. Do you have an alternative to suggest?

Comment by Robert Gacki [ 25/Jun/13 ]

Hi,

at that time, we did load tests. And in this scenario, we simulated an outtage of the Master. When that happened, the #gotError method of the DBPortPool class received either a SocketException or a EOFException (BSON) and all connections of the pool were closed. The OOM occured, when the load was still high and the pool was repopulated with new connections after the (new) Master became available, again.

From my POV, the driver should not cause OOMs when there is a connectivity problem, even under high load, because fail-over is a feature of the driver. So I questioned the strategy of dumping the entire pool.

Of course, if there is another way to mitigate the problem, I'd appreciate any hints / best practices.

Best,
Robert

Comment by Jeffrey Yemin [ 25/Jun/13 ]

It's not clear to me how closing a connection, even if an expensive operation, would lead to OOM. Can you elaborate?

Generated at Thu Feb 08 08:52:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.