[JAVA-3690] Domain name resolution issues break DefaultConnectionPool when using getAsync Created: 09/Apr/20  Updated: 28/Oct/23  Resolved: 30/Apr/20

Status: Closed
Project: Java Driver
Component/s: Async, Connection Management
Affects Version/s: 3.9.0, 4.0.0
Fix Version/s: 4.1.0

Type: Bug Priority: Major - P3
Reporter: Metod Medja Assignee: John Stewart (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by JAVA-3761 ConcurrentPool doesn't release semaph... Closed

 Description   

I've recently experienced intermittent DNS resolution issues while working on a project and these issues eventually resulted all future queries failing with "Timeout waiting for a pooled item ...". Restarting the program helped but the problem would resurface after enough resolution failures.

Now this is technically a Scala application using the Scala driver but I was able to pinpoint the problem. Invoking getAsync(SingleResultCallback<InternalConnection>) on DefaultConnectionPool will invoke openAsync to open the pooled connection if it isn't already open. And when the connection is backed by a AsynchronousSocketChannelStream or NettyStream that invokes their openAsync(AsyncCompletionHandler<Void>) which executes serverAddress.getSocketAddresses(). It appears that openAsync in DefaultConnectionPool is only expecting exceptions via it's callback. But exception thrown form serverAddress.getSocketAddresses() are propagated all the way back to it.
Now at least in the Scala driver the exception is caught by an ErrorHandlingResultCallback eventually, which stops the exception. But nothing releases the connection back to the pool. This eventually exhausts the pool and makes it unusable.

After enabling trace logs I noticed that the connection that was being opened right before ErrorHandlingResultCallback logged an error was always lost. Even after 3 hours it was never checked back into the pool or referenced in any other log message.

I believe that the openAsync methods in AsynchronousSocketChannelStream and NettyStream should capture throwables and use them to fail the AsyncCompletionHandler. I'm not sure if that could break any existing use cases. Though based on the history of these two files the execution of serverAddress.getSocketAddresses() used to be inside a try block, but was moved out of when JAVA-2700 added support for connecting to all IPs and part of the method was made recursive.



 Comments   
Comment by Venky Chowdary [ 21/Sep/22 ]

Thank you for the quick response Jeffrey.

We hit a similar issue in production where all the threads are stuck and
getting the error "Timeout waiting for a pooled item after" continuously
until the application was restarted. Unfortunately, we could not collect
the stack trace / heap dump before the application was restarted.

Do you have any suggestions on what kind of scenarios can hit this problem
? Or any other known defects in 3.7.2 release ?

Thanks,
chowdary.

Comment by Jeffrey Yemin [ 20/Sep/22 ]

malempati77@gmail.com, I don't think this particular issue is present in the 3.7 release. The regression that this fixed was introduced in the commit for JAVA-2700, which was first released in 3.9.0. So if you're seeing something similar, it's likely a different issue.

Comment by Venky Chowdary [ 19/Sep/22 ]

Is this issue applicable to 3.7.2 driver ? We are experiencing similar issue with MongoDB 3.6.9 and mongodb driver 3.7.2

Comment by Githook User [ 30/Apr/20 ]

Author:

{'name': 'John Stewart', 'email': 'john.stewart@mongodb.com', 'username': 'jstewart-mongo'}

Message: Domain name resolution issues break DefaultConnectionPool when using getAsync

JAVA-3690
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/e52a27667fd51ff66935ff73f635f6022a01b6c0

Generated at Thu Feb 08 09:00:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.