[JAVA-4974] try another srv-record on MongoSocketException Created: 24/May/23  Updated: 27/Oct/23  Resolved: 31/May/23

Status: Closed
Project: Java Driver
Component/s: Retryability
Affects Version/s: 4.0.5
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Nikita Sokolov Assignee: Valentin Kavalenka
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to JAVA-5119 failed retries exception should show ... Closed
Documentation Changes Summary:

1. What would you like to communicate to the user about this feature?
2. Would you like the user to see examples of the syntax and/or executable code and its output?
3. Which versions of the driver/connector does this apply to?


 Description   

com.mongodb.internal.operation.OperationHelper#withReadConnectionSource could try getting another server until there are none left when opening connection fails, so the users would not get the MongoSocketException in case of mongo/mongos instances being moved

com.mongodb.MongoSocketException: 10-232-23-30.backend-mongos-production-aux.backend-mongos.svc.hoffman2.local
    at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:211)
    at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:75)
    at com.mongodb.internal.connection.SocketStream.open(SocketStream.java:65)
    at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:127)
    at com.mongodb.internal.connection.UsageTrackingInternalConnection.open(UsageTrackingInternalConnection.java:50)
    at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.open(DefaultConnectionPool.java:435)
    at com.mongodb.internal.connection.DefaultConnectionPool.get(DefaultConnectionPool.java:117)
    at com.mongodb.internal.connection.DefaultConnectionPool.get(DefaultConnectionPool.java:102)
    at com.mongodb.internal.connection.DefaultServer.getConnection(DefaultServer.java:90)
    at com.mongodb.internal.binding.ClusterBinding$ClusterBindingConnectionSource.getConnection(ClusterBinding.java:119)
    at com.mongodb.client.internal.ClientSessionBinding$SessionBindingConnectionSource.getConnection(ClientSessionBinding.java:135)
    at com.mongodb.internal.operation.FindOperation$1.call(FindOperation.java:628)
    at com.mongodb.internal.operation.FindOperation$1.call(FindOperation.java:625)
    at com.mongodb.internal.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:462)
    at com.mongodb.internal.operation.FindOperation.execute(FindOperation.java:625)
    at com.mongodb.internal.operation.FindOperation.execute(FindOperation.java:77)
    at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:190)
    at com.mongodb.client.internal.FindIterableImpl.first(FindIterableImpl.java:189)



 Comments   
Comment by Valentin Kavalenka [ 12/Jul/23 ]

Roughly speaking, yes. More precisely, if the driver retries an operation, then it goes through selecting a server for the operation again. If the driver does not consider a server available, it will not select it for an operation.

Note that https://jira.mongodb.org/ is intended to be a place for reporting bugs and requesting features. If you have more questions, consider using https://www.mongodb.com/community/forums/.

Comment by Nikita Sokolov [ 12/Jul/23 ]

But after failing to lookup the 10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local, one of the hosts which the mongodb+srv:// address was resolved to, another host would be tried, right?

Comment by Valentin Kavalenka [ 11/Jul/23 ]

Would not there be two different servers in the exception if that was the case?

No, the driver does not combine the exception from the second failed attempt with the exception from the first failed attempt. However, I am not sure I correctly interpret what you mean here.

The MongoConnectionPoolClearedException you posted means that there was a thread that caught a MongoSocketException (caused by UnknownHostException) when it was looking up 10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local. As a result, that thread marked the corresponding server in the driver's view of the cluster as not available, and paused its connection pool. Concurrently, a different thread was able to select that same server for an operation, but when it tried to checkout a connection from its pool, the pool reported that it was paused (hence the MongoConnectionPoolClearedException), and that the cause of it being paused is the MongoSocketException.

Also, this would make the exceptions rare, but on our side they are as frequent as they were.

If you suspect the operations are not retried, you can


1 "Retrying ..." messages may be duplicated, that's a known fixed bug.

Comment by Nikita Sokolov [ 11/Jul/23 ]

Would not there be two different servers in the exception if that was the case? Also, this would make the exceptions rare, but on our side they are as frequent as they were.

Comment by Valentin Kavalenka [ 11/Jul/23 ]

Hi faucct@joom.com,
Due to the driver retrying a bounded number of times (currently, once), it is still possible to get such exceptions, as I mentioned. I also suggested there what can be done when downscaling to avoid having the exceptions. However, even that approach is inherently racy and does not formally prevent such exceptions from happening.

Comment by Nikita Sokolov [ 10/Jul/23 ]

We have updated the mongo driver to 4.7.2 and still getting the exception:

org.springframework.data.mongodb.UncategorizedMongoDbException: Connection pool for ServerId{clusterId=ClusterId{value='64ac114b6b363d7ed6128fc7', description='joom-ranker'}, address=10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local:27017} is paused because another operation failed; nested exception is com.mongodb.MongoConnectionPoolClearedException: Connection pool for ServerId{clusterId=ClusterId{value='64ac114b6b363d7ed6128fc7', description='joom-ranker'}, address=10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local:27017} is paused because another operation failed
    at org.springframework.data.mongodb.core.MongoExceptionTranslator.translateExceptionIfPossible(MongoExceptionTranslator.java:133)
    at org.springframework.data.mongodb.core.ReactiveMongoTemplate.potentiallyConvertRuntimeException(ReactiveMongoTemplate.java:2814)
    at org.springframework.data.mongodb.core.ReactiveMongoTemplate.lambda$translateException$90(ReactiveMongoTemplate.java:2797)
    at reactor.core.publisher.Mono.lambda$onErrorMap$30(Mono.java:3325)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:88)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondError(MonoFlatMap.java:185)
    at reactor.core.publisher.MonoFlatMap$FlatMapInner.onError(MonoFlatMap.java:251)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onError(MonoFlatMap.java:165)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onError(MonoFlatMap.java:165)
    at reactor.core.publisher.FluxMap$MapSubscriber.onError(FluxMap.java:126)
    at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:87)
    at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:87)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondError(MonoFlatMap.java:185)
    at reactor.core.publisher.MonoFlatMap$FlatMapInner.onError(MonoFlatMap.java:251)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onError(MonoPeekTerminal.java:251)
    at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:190)
    at com.mongodb.reactivestreams.client.internal.MongoOperationPublisher.lambda$sinkToCallback$31(MongoOperationPublisher.java:570)
    at com.mongodb.reactivestreams.client.internal.OperationExecutorImpl.lambda$execute$2(OperationExecutorImpl.java:92)
    at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
    at com.mongodb.internal.async.function.AsyncCallbackSupplier.lambda$whenComplete$1(AsyncCallbackSupplier.java:95)
    at com.mongodb.internal.async.function.RetryingAsyncCallbackSupplier$RetryingCallback.onResult(RetryingAsyncCallbackSupplier.java:109)
    at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
    at com.mongodb.internal.async.function.AsyncCallbackSupplier.lambda$whenComplete$1(AsyncCallbackSupplier.java:95)
    at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
    at com.mongodb.internal.operation.OperationHelper.lambda$withAsyncSuppliedResource$7(OperationHelper.java:640)
    at com.mongodb.internal.connection.DefaultServer$1.onResult(DefaultServer.java:130)
    at com.mongodb.internal.connection.DefaultServer$1.onResult(DefaultServer.java:122)
    at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
    at com.mongodb.internal.connection.DefaultConnectionPool.lambda$getAsync$0(DefaultConnectionPool.java:199)
    at com.mongodb.internal.connection.DefaultConnectionPool.lambda$getAsync$1(DefaultConnectionPool.java:216)
    at com.mongodb.internal.connection.DefaultConnectionPool$Task.doComplete(DefaultConnectionPool.java:1382)
    at com.mongodb.internal.connection.DefaultConnectionPool$Task.execute(DefaultConnectionPool.java:1368)
    at com.mongodb.internal.connection.DefaultConnectionPool$AsyncWorkManager.workerRun(DefaultConnectionPool.java:1309)
    at com.mongodb.internal.connection.DefaultConnectionPool$AsyncWorkManager.runAndLogUncaught(DefaultConnectionPool.java:1339)
    at com.mongodb.internal.connection.DefaultConnectionPool$AsyncWorkManager.lambda$initUnlessClosed$0(DefaultConnectionPool.java:1274)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.mongodb.MongoConnectionPoolClearedException: Connection pool for ServerId{clusterId=ClusterId{value='64ac114b6b363d7ed6128fc7', description='joom-ranker'}, address=10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local:27017} is paused because another operation failed
    at com.mongodb.internal.connection.DefaultConnectionPool$StateAndGeneration.lambda$pauseAndIncrementGeneration$0(DefaultConnectionPool.java:1491)
    at com.mongodb.internal.connection.ConcurrentPool$StateAndPermits.throwIfClosedOrPaused(ConcurrentPool.java:498)
    at com.mongodb.internal.connection.ConcurrentPool$StateAndPermits.acquirePermit(ConcurrentPool.java:401)
    at com.mongodb.internal.connection.ConcurrentPool.get(ConcurrentPool.java:152)
    at com.mongodb.internal.connection.DefaultConnectionPool.getPooledConnection(DefaultConnectionPool.java:304)
    at com.mongodb.internal.connection.DefaultConnectionPool.lambda$getAsync$1(DefaultConnectionPool.java:214)
    ... 10 common frames omitted
Caused by: com.mongodb.MongoSocketException: 10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local: Name or service not known
    at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:217)
    at com.mongodb.internal.connection.AsynchronousSocketChannelStream.openAsync(AsynchronousSocketChannelStream.java:60)
    at com.mongodb.internal.connection.InternalStreamConnection.openAsync(InternalStreamConnection.java:202)
    at com.mongodb.internal.connection.UsageTrackingInternalConnection.openAsync(UsageTrackingInternalConnection.java:64)
    at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.openAsync(DefaultConnectionPool.java:546)
    at com.mongodb.internal.connection.DefaultConnectionPool$OpenConcurrencyLimiter.openAsyncWithConcurrencyLimit(DefaultConnectionPool.java:950)
    at com.mongodb.internal.connection.DefaultConnectionPool.lambda$getAsync$1(DefaultConnectionPool.java:230)
    ... 10 common frames omitted
Caused by: java.net.UnknownHostException: 10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local: Name or service not known
    at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
    at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
    at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1529)
    at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1519)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1378)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
    at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:209)
    ... 16 common frames omitted 

Comment by Valentin Kavalenka [ 31/May/23 ]

faucct@joom.com I noticed there was a bug https://jira.mongodb.org/browse/JAVA-4684 in the retry logic introduced in 4.4, which was fixed only in 4.7. Therefore, if you decide to upgrade, and you can't upgrade to the latest version, then at the very least upgrade to 4.7.2.

Comment by Valentin Kavalenka [ 31/May/23 ]

Hi faucct@joom.com

com.mongodb.internal.operation.OperationHelper#withReadConnectionSource could try getting another server until there are none left when opening connection fails, so the users would not get the MongoSocketException in case of mongo/mongos instances being moved

As of 4.4 (see https://jira.mongodb.org/browse/JAVA-4034), retryable reads and writes do re-select the server during the second attempt. However, since currently the driver retries only once, it is still possible for it to select another server that is gone as a result of downscaling you mentioned, if the driver has not yet detected that the server is gone. If that happens, you still may observe the error you reported even if you upgrade the driver to 4.4.2 (given how big this jump is for you, you might as well try upgrading to the latest version, which is 4.9.1, provided that it's compatible with your MongoDB server version).

Alternatively to upgrading and/or in addition to that, you may try removing those of your DNS SRV records that point to hosts that will be removed as a result of downscaling, before you do the downscaling. If the time interval between removing the SRV records pointing to soon-to-be-gone hosts and the downscaling is larger than 60 seconds (the driver's SRV records monitoring interval) + SRV record TTL, then the driver should see the corresponding seed hosts gone, and remove them from its cluster view, which should eliminate the errors you reported. However, if some hosts removed by downscaling were discovered by the driver not from the SRV records, then the driver still may select them for an operation and fail with the error you reported until it realizes that they are no longer available / part of the cluster.

Comment by Nikita Sokolov [ 24/May/23 ]

Yes, we are using the mongodb+srv scheme.
This is not a blocker, but we occasionally get a spike of those errors and those could be avoided. I guess this happens when the kubernetes mongos deployment is being downscaled.

Comment by Jeffrey Yemin [ 24/May/23 ]

Thanks. And just to confirm: you're using a mongodb+srv connection string?

Also, is this a blocker for you? Do you expect DNS failures like this in production environment? If so, can you elaborate on the scenario? Why would a subset of mongos servers not be available in DNS lookup like this?

Comment by Nikita Sokolov [ 24/May/23 ]

I have added the UnknownHostException. The rest is the spring-data up the stack.

Comment by Jeffrey Yemin [ 24/May/23 ]

Hi faucct@joom.com

There should be a nested IOException available. Can you post that as well?

Comment by Nikita Sokolov [ 24/May/23 ]

Caused by: java.net.UnknownHostException: 10-232-23-30.backend-mongos-production-aux.backend-mongos.svc.hoffman2.local
    at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:797)
    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1519)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1378)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
    at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:203)

Comment by Nikita Sokolov [ 24/May/23 ]

The driver version is 4.0.5.

Comment by PM Bot [ 24/May/23 ]

Hi faucct@joom.com, thank you for reporting this issue! The team will look into it and get back to you soon.

Generated at Thu Feb 08 09:03:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.