[JAVA-4122] Getting Exceptions when CPU starving the driver Created: 28/Apr/21  Updated: 27/Oct/23  Resolved: 24/May/21

Status: Closed
Project: Java Driver
Component/s: Scala
Affects Version/s: 4.2.3
Fix Version/s: None

Type: Bug Priority: Unknown
Reporter: Jork Zijlstra Assignee: Jeffrey Yemin
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screenshot 2021-04-30 at 16.45.53.png     PNG File api-load-04-23 (2).png     PNG File api-network-overview (2).png     JPEG File cpu-load.jpg     Zip Archive graylog-search-result-absolute-2021-05-03T07_25_08.000Z-2021-05-03T07_46_08.000Z (1) (1).zip     File graylog-search-result-absolute-2021-05-03T07_25_08.000Z-2021-05-03T07_46_08.000Z (1).csv     PNG File image-2021-05-03-15-26-01-507.png    
Issue Links:
Related
is related to JAVA-4210 MongoDb returns message length larger... Closed
Case:

 Description   

Using the Scala driver (4.2.3) when we switched over to the use of an Atlas cluster we started noticing some instabilities in our application. 

We where for instance getting

 com.mongodb.MongoInternalException: Unexpected runtime exception
	at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:638)
	at com.mongodb.internal.connection.InternalStreamConnection.readAsync(InternalStreamConnection.java:591)
	at com.mongodb.internal.connection.InternalStreamConnection.access$1100(InternalStreamConnection.java:78)
	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:701)
	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:686)
	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:581)
	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:578)
	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:250)
	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:233)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannel.lambda$read$4(AsynchronousTlsChannel.java:122)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannel.lambda$read$5(AsynchronousTlsChannel.java:122)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.doRead(AsynchronousTlsChannelGroup.java:569)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.lambda$processRead$5(AsynchronousTlsChannelGroup.java:475)
	at java.base/java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.processRead(AsynchronousTlsChannelGroup.java:472)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.loop(AsynchronousTlsChannelGroup.java:407)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: capacity < 0: (-16 < 0)
	at java.base/java.nio.Buffer.createCapacityException(Unknown Source)
	at java.base/java.nio.ByteBuffer.allocate(Unknown Source)
	at com.mongodb.internal.connection.PowerOfTwoBufferPool.createNew(PowerOfTwoBufferPool.java:90)
	at com.mongodb.internal.connection.PowerOfTwoBufferPool.getByteBuffer(PowerOfTwoBufferPool.java:82)
	at com.mongodb.internal.connection.PowerOfTwoBufferPool.getBuffer(PowerOfTwoBufferPool.java:77)
	at com.mongodb.internal.connection.AsynchronousChannelStream.readAsync(AsynchronousChannelStream.java:111)
	at com.mongodb.internal.connection.AsynchronousChannelStream.readAsync(AsynchronousChannelStream.java:107)
	at com.mongodb.internal.connection.InternalStreamConnection.readAsync(InternalStreamConnection.java:578)
	... 23 common frames omitted

and

com.mongodb.MongoInternalException: The reply message length 680237824 is greater than the maximum message length 48000000
	at com.mongodb.internal.connection.MessageHeader.<init>(MessageHeader.java:41)
	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:700)
	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:686)
	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:581)
	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:578)
	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:250)
	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:233)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannel.lambda$read$4(AsynchronousTlsChannel.java:122)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannel.lambda$read$5(AsynchronousTlsChannel.java:122)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.doRead(AsynchronousTlsChannelGroup.java:569)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.lambda$processRead$5(AsynchronousTlsChannelGroup.java:475)
	at java.base/java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.processRead(AsynchronousTlsChannelGroup.java:472)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannelGroup.loop(AsynchronousTlsChannelGroup.java:407)
	at java.base/java.lang.Thread.run(Unknown Source) 

It seems that the driver is throwing these exceptions when it's starved for CPU. We are running our application on K8s and have specific resources set. After doing some load testing the exception seems to disappear after we double the cpu resources to be able to handle 50% of the same traffic. 

Can it be something is not thread safe?

Why we need at least double the cpu resources I don't know yet.



 Comments   
Comment by Backlog - Core Eng Program Management Team [ 24/May/21 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by Jeffrey Yemin [ 08/May/21 ]

I had a closer look at the CSV file containing the application logs, and noticed that there are many OutOfMemoryError exceptions being thrown as well. While this doesn't explain the other errors, it may be a contributing factor. First example:

2021-05-03T07:34:05.996Z, "Error during processing of request: 'Boxed Error'. 
Completing with 500 Internal Server Error response. To change default exception handling behavior, provide a custom ExceptionHandler.",java
.util.concurrent.ExecutionException:
Boxed Error
at scala.concurrent.impl.Promise$.resolver(Promise.scala:83)
at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:75)\\
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:280)
at scala.concurrent.Promise.complete(Promise.scala:49)
at scala.concurrent.Promise.complete$(Promise.scala:48)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
at scala.concurrent.Promise.failure(Promise.scala:100)
at scala.concurrent.Promise.failure$(Promise.scala:100)
at scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:183)
at org.mongodb.scala.Observable$$anon$2.$anonfun$onError$1(Observable.scala:399)
at org.mongodb.scala.Observable$$anon$2.completeWith(Observable.scala:419)
at org.mongodb.scala.Observable$$anon$2.onError(Observable.scala:398)
at org.mongodb.scala.ObservableImplicits$ToSingleObservablePublisher$$anon$1.$anonfun$onError$1(ObservableImplicits.scala:102)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at org.mongodb.scala.ObservableImplicits$ToSingleObservablePublisher$$anon$1.completeWith(ObservableImplicits.scala:121)
at org.mongodb.scala.ObservableImplicits$ToSingleObservablePublisher$$anon$1.onError(ObservableImplicits.scala:102)
at org.mongodb.scala.internal.SubscriptionCheckingObserver.onError(SubscriptionCheckingObserver.scala:38)
at reactor.core.publisher.StrictSubscriber.onError(StrictSubscriber.java:106)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onError(MonoFlatMap.java:165)
at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onError(FluxMapFuseable.java:134)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondError(MonoFlatMap.java:185)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onError(MonoFlatMap.java:251)
at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:214)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:185)
at com.mongodb.reactivestreams.client.internal.MongoOperationPublisher.lambda$sinkToCallback$30(MongoOperationPublisher.java:546)
at com.mongodb.reactivestreams.client.internal.OperationExecutorImpl.lambda$execute$2(OperationExecutorImpl.java:73)
at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
at com.mongodb.internal.operation.FindOperation$3.onResult(FindOperation.java:752)
at com.mongodb.internal.operation.OperationHelper$ReferenceCountedReleasingWrappedCallback.onResult(OperationHelper.java:532)
at com.mongodb.internal.operation.CommandOperationHelper$10.checkRetryableException(CommandOperationHelper.java:495)
at com.mongodb.internal.operation.CommandOperationHelper$10.onResult(CommandOperationHelper.java:480)
at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor$2.onResult(DefaultServer.java:283)
at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
at com.mongodb.internal.connection.CommandProtocolImpl$1.onResult(CommandProtocolImpl.java:82)
at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection$2.onResult(DefaultConnectionPool.java:530)
at com.mongodb.internal.connection.UsageTrackingInternalConnection$2.onResult(UsageTrackingInternalConnection.java:142)
at com.mongodb.internal.async.ErrorHandlingResultCallback.onResult(ErrorHandlingResultCallback.java:48)
at com.mongodb.internal.connection.InternalStreamConnection$2$1.onResult(InternalStreamConnection.java:444)
at com.mongodb.internal.connection.InternalStreamConnection$2$1.onResult(InternalStreamConnection.java:439)
at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:703)
at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback.onResult(InternalStreamConnection.java:686)
at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:581)
at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:578)
at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:250)
at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:233)
at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannel.lambda$read$4(AsynchronousTlsChannel.java:122)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space

Comment by Sheila Doyle [ 07/May/21 ]

jeff.yemin I agree and have communicated this with the customer.

 

We will open a help and upload the application logs with DEBUG level covering the next occurrence of the issue. You may close this ticket and thank you for your help.

 

Sheila

Comment by Sheila Doyle [ 06/May/21 ]

Hi jeff.yemin,

 

TRACE level logging has been provided and the customer confirms that the logs are all in UTC. Please let me know if you need any more information.

 

Best regards,

 

Sheila

Comment by Sheila Doyle [ 06/May/21 ]

Hello jeff.yemin,

 

Attached the latest logs provided by Jork from the client side (INFO level logs). So far the issues has since not returned yet. I will check out the cluster resources for the next occurrence of the issue, Jork to provide a timestamp, also uploading screenshots showing network traffic.

For the last to days I had a JMX profiler connected to the application and enable CPU profiling but nothing showed up.

With the exception on 05-03 07:33 I only see a small increase in network traffic, not even reaching the 1Mb/sec whereas on the 04-23 it was around 8Mb/sec (by the read application). Looking at the mongo network metrics these lines up with spikes on the cluster.

Jork to provide logs with DEBUG level covering the next occurance, please let us know if you require anything else.

 

Best regards,

 

Sheila

Comment by Jeffrey Yemin [ 03/May/21 ]

Hi jork.zijlstra@nielsen.com

Can you re-run your tests with TRACE logging in the driver enabled and share the application logs?  See http://mongodb.github.io/mongo-java-driver/4.2/driver-reactive/reference/logging/ for details.  Probably best to share them on the Support ticket rather than here.

Comment by Jork Zijlstra [ 03/May/21 ]

The one thing I just realized is that I actually can't say if the cpu starvation is the cause or the effect yet. Having the exceptions might also trigger in higher cpu usage instead of the other way around.

Comment by Jork Zijlstra [ 03/May/21 ]

The issues from the 7 minute windows this morning are actually matching up with cpu starvation again.

 

Comment by Jork Zijlstra [ 03/May/21 ]

Attaching the logs from the clients side issues this morning. We had a 7 minute window where these occurred after which the application restarted at 07:39 which is matching up with the connection reset by peer message on the mongo master.

Comment by Jork Zijlstra [ 03/May/21 ]

Currently we deployed the application with twice the cpu and putting some load on it. We do see that is became more stable, since the application is no crashing, but the exceptions are still happening.

An additional exception I found was (2021-05-03T07:39:29.995Z):

com.mongodb.MongoInternalException: Unexpected reply message opCode 50361199
	at com.mongodb.internal.connection.ReplyHeader.<init>(ReplyHeader.java:93)
	at com.mongodb.internal.connection.ReplyHeader.<init>(ReplyHeader.java:56)
	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback$MessageCallback.onResult(InternalStreamConnection.java:741)
	at com.mongodb.internal.connection.InternalStreamConnection$MessageHeaderCallback$MessageCallback.onResult(InternalStreamConnection.java:711)
	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:581)
	at com.mongodb.internal.connection.InternalStreamConnection$5.completed(InternalStreamConnection.java:578)
	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:250)
	at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.completed(AsynchronousChannelStream.java:233)
	at com.mongodb.internal.connection.tlschannel.async.AsynchronousTlsChannel.lambda$read$4(AsynchronousTlsChannel.java:122)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

I'm also still debugging everything / going through logs

Comment by Jork Zijlstra [ 03/May/21 ]

Hi Sheila,

Thanks for researching this further.

I wondering how you came to this: We see that at this exact time a query triggered a scan of more than 12 million objects which indicates that the workload is not optimised. I have attached a screenshot of the database and as you can see this database has 5 collection where the biggest one only hold 1328 records. Far less than the 12 million records you are mentioning.

I'm suspecting you searched adifferent stack which indeed has an action collection which holds 12 million records, not this stack where we reported this issue for?

All our queries are find by 2 id's, something like {Id1: "", Id2: ""} and am sure that all our queries use indexes that we created for those actions.

Regards, Jork

Comment by Sheila Doyle [ 30/Apr/21 ]

 

Adding the diagnostic data metrics for timestamp of incident 11:54:24 on 23/04/21, We can see there was an operation that involved scanning 12 million objects.

 

Also we see that 102 connections dropped at the same time.

 

 

Comment by Jork Zijlstra [ 30/Apr/21 ]

Hello @jeff.yemin,

I did a quick test but it seems to behave the same. It's actually quite hard to simulate. You need to generate just enough traffic so that queries won't start timing out. I wouldn't say yet that using Netty can be a fix or not.

Comment by Sheila Doyle [ 29/Apr/21 ]

jeff.yemin adding  comment from jork_zijlstra observation for completeness:

 

Issues disappeared after I gave the application more resources and stress tested it. When scaling back again the issues reappeared.I gave the application double the amount of cpu now. I can only think of 2 options why we need double the cpu resources.Switch to Atlas with TLS connection which requires encoding / decoding, which are cpu intense operations.Switch from Intel to AMD CPU usage where we have noticed that the AMD cpu can actually handle less for the same amount of cpu resources as Intel

Comment by Jork Zijlstra [ 29/Apr/21 ]

Hello @jeff.yemin,

I can try to see if I can spent some time on this next week. We used to use this NettyStreamFactory when using the 2.8.0 driver so it shouldn't be too big of a change. 

As for support we do have a ticket open: https://support.mongodb.com/case/00769901 

 

Regards, Jork

Comment by Jeffrey Yemin [ 28/Apr/21 ]

jork.zijlstra@nielsen.com, could you try one experiment for me?  Try using Netty as the underlying transport and see if you can still reproduce the exception.  The code for that looks something like this:

NettyStreamFactoryFactory nettyStreamFactoryFactory = NettyStreamFactoryFactory.builder().build();
MongoClientSettings settings = MongoClientSettings.builder()
 .applyConnectionString(new ConnectionString("<connection string>"))
 .streamFactoryFactory(nettyStreamFactoryFactory)
 .build();
MongoClient client = MongoClients.create(settings);

You'll have to take a dependency on Netty if you don't have one already.

Thanks,

Jeff

 

 

Comment by Jeffrey Yemin [ 28/Apr/21 ]

Hi jork.zijlstra@nielsen.com,

Thanks for reporting this. What you're seeing is definitely unexpected, even in the face of client resource constraints. We will do some investigation and get back to you soon with next steps.

I also want to note some other resources to get this question answered more quickly:

  • Our MongoDB support portal, located at support.mongodb.com
  • Our MongoDB community portal, located here
  • If you are an Atlas customer, you can review your support options by clicking Support in the top menu bar of the Atlas UI

Let me know if you've already tried any of the above.

Thank you!

Generated at Thu Feb 08 09:01:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.