[JAVA-3647] Race condition between closing a connection and reusing it Created: 06/Mar/20  Updated: 28/Oct/23  Resolved: 13/Mar/20

Status: Closed
Project: Java Driver
Component/s: Reactive Streams
Affects Version/s: 4.0.0
Fix Version/s: 4.0.1, 3.12.3

Type: Bug Priority: Major - P3
Reporter: Boris Petrov Assignee: Ross Lawley
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MongoDB server 4.2.3 running on Linux.


Attachments: Zip Archive debug-traces.zip     Zip Archive trace.zip    
Issue Links:
Duplicate
is duplicated by JAVA-3616 java.lang.IllegalStateException: Atte... Closed

 Description   

I'm not sure this is a problem that started from version 4 of the driver. Here goes.

Up to now I was using `org.mongodb:mongodb-driver-reactivestreams:1.13.0`. This was my code:

MongoClientSettings.Builder settingsBuilder = MongoClientSettings
	.builder()
	.applyToClusterSettings(
		clusterSettingsBuilder -> clusterSettingsBuilder.hosts(configuration.getAddresses()).serverSelectionTimeout(5, TimeUnit.SECONDS))
	.compressorList(Arrays.asList(MongoCompressor.createZlibCompressor(), MongoCompressor.createSnappyCompressor()))
	.applyToConnectionPoolSettings(connectionPoolSettingsBuilder -> connectionPoolSettingsBuilder.maxWaitQueueSize(20_000))
	.applyToSocketSettings(socketSettingsBuilder -> socketSettingsBuilder.connectTimeout(5, TimeUnit.SECONDS))
	.writeConcern(WriteConcern.MAJORITY.withJournal(Boolean.TRUE).withWTimeout(5, TimeUnit.SECONDS));

 As you see I was using the deprecated `maxWaitQueueSize` because otherwise my code was getting exceptions because of "Too many operations are already waiting for a connection". Now, with that method gone in version 4, running my code also led to millions of exceptions but this time due to a different problem. What I am getting every time is this:

 

Got socket exception on connection [connectionId{localValue:21, serverValue:11252}] to localhost:27017. All connections to localhost:27017 will be closed.

 

I enabled logging for the driver and followed this "localValue:21" to see what happens with it. This is the chain of commands I see:

 

Checked out connection [connectionId{localValue:21}] to server localhost:27017
Asynchronously opening pooled connection connectionId{localValue:21} to server ServerId{clusterId=ClusterId{value='5e621d085a1fa329367de3a4', description='null'}, address=localhost:27017}
Pooled connection connectionId{localValue:21} to server ServerId{clusterId=ClusterId{value='5e621d085a1fa329367de3a4', description='null'}, address=localhost:27017} is not yet open
Opened connection [connectionId{localValue:21, serverValue:11252}] to localhost:27017
Pooled connection connectionId{localValue:21, serverValue:11252} to server ServerId{clusterId=ClusterId{value='5e621d085a1fa329367de3a4', description='null'}, address=localhost:27017} is now open
Sending command ...
Execution of command with request id 106 completed successfully in 1.64 ms on connection [connectionId{localValue:21, serverValue:11252}] to server localhost:27017
Checked in connection [connectionId{localValue:21, serverValue:11252}] to server localhost:27017
Closing connection connectionId{localValue:21, serverValue:11252}
Checked out connection [connectionId{localValue:21, serverValue:11252}] to server localhost:27017
Asynchronously opening pooled connection connectionId{localValue:21, serverValue:11252} to server ServerId{clusterId=ClusterId{value='5e621d085a1fa329367de3a4', description='null'}, address=localhost:27017}
Pooled connection connectionId{localValue:21, serverValue:11252} to server ServerId{clusterId=ClusterId{value='5e621d085a1fa329367de3a4', description='null'}, address=localhost:27017} is already open
WARN org.mongodb.driver.connection - Got socket exception on connection [connectionId{localValue:21, serverValue:11252}] to localhost:27017. All connections to localhost:27017 will be closed.

As you see, closing of the connection is initiated but then the connection is checked out to be used again. I guess that is what leads to this issue.

It happens every single time. What I did to "fix" it was:

.applyToConnectionPoolSettings(
	connectionPoolSettingsBuilder -> connectionPoolSettingsBuilder.minSize(connectionPoolSize).maxSize(connectionPoolSize))

Setting the connections to not expire seems to fix the problem (which acknowledges what I'm saying about a race-condition somewhere).

 



 Comments   
Comment by Githook User [ 16/Mar/20 ]

Author:

{'username': 'rozza', 'name': 'Ross Lawley', 'email': 'ross.lawley@gmail.com'}

Message: Ensure getting the description doesn't require the pool to be open.

JAVA-3657 JAVA-3616, JAVA-3647
Branch: 3.12.x
https://github.com/mongodb/mongo-java-driver/commit/a7a3a290364bb546a23d177d61910d1107be13ec

Comment by Boris Petrov [ 13/Mar/20 ]

Great, glad to have helped! I'll be sure to try out the new version and will let you know if there are more issues. Thanks for the support!

Comment by Ross Lawley [ 13/Mar/20 ]

Hi alien,

I believe with your help we have identified the source of the bug. Many thanks for your patience and quick response to questions with regards to this issue.

A fix will be released in the 4.0.1 version of the driver.

All the best,

Ross

Comment by Githook User [ 13/Mar/20 ]

Author:

{'username': 'rozza', 'name': 'Ross Lawley', 'email': 'ross.lawley@gmail.com'}

Message: Ensure getting the description doesn't require the pool to be open.

JAVA-3657 JAVA-3616, JAVA-3647
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/cedb69c82bd4bd321a2929b4110f574dee5a2881

Comment by Githook User [ 13/Mar/20 ]

Author:

{'username': 'rozza', 'name': 'Ross Lawley', 'email': 'ross.lawley@gmail.com'}

Message: Ensure getting the description doesn't require the pool to be open.

JAVA-3657 JAVA-3616, JAVA-3647
Branch: 4.0.x
https://github.com/mongodb/mongo-java-driver/commit/41ca7ca4f42f0a13daaf7095cadcb4c764171cfa

Comment by Boris Petrov [ 11/Mar/20 ]

Ah, sorry, I didn't get that you meant the MongoDB server. Well, as far as I can see from the logs, nothing too interesting happens:

2020-03-11T09:07:56.657+0200 I  COMMAND  [conn77614] command test.objects command: createIndexes { createIndexes: "objects", indexes: ..., writeConcern: { w: "majority", wtimeout: 5000, j: true }, $db: "test" } numYields:7 reslen:114 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 9 } }, ReplicationStateTransition: { acquireCount: { w: 10 } }, Global: { acquireCount: { r: 1, w: 9 } }, Database: { acquireCount: { r: 1, w: 9 } }, Collection: { acquireCount: { r: 9, w: 1, R: 1, W: 2 }, acquireWaitCount: { r: 2, w: 1, R: 1, W: 2 }, timeAcquiringMicros: { r: 955166, w: 135211, R: 33390, W: 419305 } }, Mutex: { acquireCount: { r: 4 } } } flowControl:{ acquireCount: 8 } storage:{} protocol:op_msg 2098ms

Then a bunch of:

2020-03-11T09:08:01.806+0200 I  NETWORK  [conn77615] received client metadata from 127.0.0.1:37138 conn77615: { driver: { name: "mongo-java-driver|reactive-streams", version: "4.0.0" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "5.5.7-arch1-1" }, platform: "Java/N/A/13.0.2+8" }

Some updates, inserts, etc. I tried searching for "err" but nothing appears. Note that I'm using Java 13. It shouldn't matter but you never know.

Keep me updated with your try. If you can't reproduce it, let me know and we'll see how to proceed.

Thank you too!

Comment by Jeffrey Yemin [ 11/Mar/20 ]

Hi alien

My request is for server logs, not application logs. While there may not be anything special about your application, it's possible that your MongoDB deployment is different in a way that would make it difficult to reproduce.

We often ask for reproducers even if it seems like there is nothing special about the application. For example, I don't know how many concurrent requests the application is executing. That said, I do plan to try building my own reproducer.

Thanks for your continued patience.

Comment by Boris Petrov [ 11/Mar/20 ]

I've sent you the full logs - if there were any application exceptions they would have been in it. So there is nothing more I can give you as information unfortunately.

As for the reproduction - I haven't tried. But I don't think our application is special in any way. Have you tried running a simple app that just reads and writes a million times a second to a Mongo collection for a few minutes? I'm pretty sure it will break in the same way. If you're having troubles reproducing the problem even with such an example app, I could try doing one, but that will be in a few days at least.

Thanks!

Comment by Jeffrey Yemin [ 11/Mar/20 ]

Hi alien

Thank you for the traces. I see that the first error is java.nio.channels.AsynchronousCloseException at Mar 11 07:08:02 UTC 2020. Can you locate any related messages in the server logs that correspond to this exception?

Also, do you have a reproducer that you could either attach or describe? Or can you only repro in an actual application that you're running?

Comment by Boris Petrov [ 11/Mar/20 ]

debug-traces.zip

I've attached 2 traces. There's so many and different errors happening that I'm not sure this will help you.

Just to let you know that we've downgraded to 1.13.0 due to these problems.

Thanks for the support!

Comment by Jeffrey Yemin [ 10/Mar/20 ]

JAVA-3657 looks to be the root cause of the stack trace in trace.zip:

    java.lang.IllegalStateException: state should be: open
        at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)
        at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.getDescription(DefaultConnectionPool.java:563)
       ...

Would it be possible to re-run your tests with only debug logging enabled, and not trace?

Comment by Boris Petrov [ 08/Mar/20 ]

We got this exception even with the "fix" I specified above:

ERROR Callback onResult call produced an error
       java.lang.IllegalStateException: Attempted to decrement the reference count below 0
       	at com.mongodb.internal.binding.AbstractReferenceCounted.release(AbstractReferenceCounted.java:39)
       	at com.mongodb.internal.binding.AsyncClusterBinding$AsyncClusterBindingConnectionSource.release(AsyncClusterBinding.java:135)
       	at com.mongodb.internal.operation.OperationHelper$6.onResult(OperationHelper.java:595)
       	at com.mongodb.internal.operation.OperationHelper$6.onResult(OperationHelper.java:592)
       	at com.mongodb.internal.connection.DefaultServer$1.onResult(DefaultServer.java:112)
       	at com.mongodb.internal.connection.DefaultServer$1.onResult(DefaultServer.java:103)
       	at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
       	at com.mongodb.internal.connection.DefaultConnectionPool$2.onResult(DefaultConnectionPool.java:217)
       	at com.mongodb.internal.connection.DefaultConnectionPool$2.onResult(DefaultConnectionPool.java:207)
       	at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection$1.onResult(DefaultConnectionPool.java:448)
       	at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection$1.onResult(DefaultConnectionPool.java:442)
       	at com.mongodb.internal.connection.UsageTrackingInternalConnection$1.onResult(UsageTrackingInternalConnection.java:61)
       	at com.mongodb.internal.connection.UsageTrackingInternalConnection$1.onResult(UsageTrackingInternalConnection.java:57)
       	at com.mongodb.internal.connection.InternalStreamConnection$1$1.onResult(InternalStreamConnection.java:159)
       	at com.mongodb.internal.connection.InternalStreamConnection$1$1.onResult(InternalStreamConnection.java:154)
       	at com.mongodb.internal.connection.InternalStreamConnectionInitializer$1$1.onResult(InternalStreamConnectionInitializer.java:85)
       	at com.mongodb.internal.connection.InternalStreamConnectionInitializer$1$1.onResult(InternalStreamConnectionInitializer.java:81)
       	at com.mongodb.internal.connection.SaslAuthenticator$Continuator.onResult(SaslAuthenticator.java:230)
       	at com.mongodb.internal.connection.SaslAuthenticator$Continuator.onResult(SaslAuthenticator.java:213)
       	at com.mongodb.internal.connection.CommandHelper$1.onResult(CommandHelper.java:57)
       	at com.mongodb.internal.connection.CommandHelper$1.onResult(CommandHelper.java:53)
       	at com.mongodb.internal.connection.InternalStreamConnection$2$1.onResult(InternalStreamConnection.java:400)
       	at com.mongodb.internal.connection.InternalStreamConnection$2$1.onResult(InternalStreamConnection.java:375)
       	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback$MessageCallback.onResult(InternalStreamConnection.java:676)
       	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback$MessageCallback.onResult(InternalStreamConnection.java:643)
       	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:513)
       	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:510)
       	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:220)
       	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:203)
       	at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:127)
       	at java.base/sun.nio.ch.Invoker.invokeDirect(Invoker.java:158)
       	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.implRead(UnixAsynchronousSocketChannelImpl.java:568)
       	at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.read(AsynchronousSocketChannelImpl.java:276)
       	at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.read(AsynchronousSocketChannelImpl.java:297)
       	at com.mongodb.internal.connection.AsynchronousSocketChannelStream$AsynchronousSocketChannelAdapter.read(AsynchronousSocketChannelStream.java:136)
       	at com.mongodb.internal.connection.AsynchronousChannelStream.readAsync(AsynchronousChannelStream.java:105)
       	at com.mongodb.internal.connection.InternalStreamConnection.readAsync(InternalStreamConnection.java:510)
       	at com.mongodb.internal.connection.InternalStreamConnection.access$1000(InternalStreamConnection.java:75)
       	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:633)
       	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:618)
       	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:513)
       	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:510)
       	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:220)
       	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:203)
       	at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:127)
       	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishRead(UnixAsynchronousSocketChannelImpl.java:443)
       	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:193)
       	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:215)
       	at java.base/sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:306)
       	at java.base/sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112)
       	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       	at java.base/java.lang.Thread.run(Thread.java:830)

Although it appeared after a number of runs of our tests so it's very rare. I guess this is again due to some race-condition somewhere. Not sure if it is related to the original issue.

P.S. I saw a number of those in the traces in the original issue so it's probably related.

Comment by Boris Petrov [ 07/Mar/20 ]

Thanks for the support! I've attached a trace file. Be sure to `less -R` it (after extracting, of course). I'm not sure that these will help you pinpoint the problem but I hope they can convince you something is amiss. Please tell me if I can do something more!

Comment by Jeffrey Yemin [ 06/Mar/20 ]

Thanks for reporting this.  With the information you've provided, I can't fully reconstruct the chain of events.   I gather you don't see any messages like:

Closed connection [connectionId{localValue:21, serverValue:11252}] to server localhost:27017  because ...

?

What other log messages (for other connections presumably) do you see between these two:

Checked in connection [connectionId{localValue:21, serverValue:11252}] to server localhost:27017
Closing connection connectionId{localValue:21, serverValue:11252}

 

If you could provide the full logs for the time between the first and last log messages you provided (including all connection activity), that would be helpful.

Generated at Thu Feb 08 09:00:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.