[JAVA-4122] Getting Exceptions when CPU starving the driver Created: 28/Apr/21 Updated: 27/Oct/23 Resolved: 24/May/21 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Scala |
| Affects Version/s: | 4.2.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Unknown |
| Reporter: | Jork Zijlstra | Assignee: | Jeffrey Yemin |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
Using the Scala driver (4.2.3) when we switched over to the use of an Atlas cluster we started noticing some instabilities in our application. We where for instance getting
and
It seems that the driver is throwing these exceptions when it's starved for CPU. We are running our application on K8s and have specific resources set. After doing some load testing the exception seems to disappear after we double the cpu resources to be able to handle 50% of the same traffic. Can it be something is not thread safe? Why we need at least double the cpu resources I don't know yet. |
| Comments |
| Comment by Backlog - Core Eng Program Management Team [ 24/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 08/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I had a closer look at the CSV file containing the application logs, and noticed that there are many OutOfMemoryError exceptions being thrown as well. While this doesn't explain the other errors, it may be a contributing factor. First example:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sheila Doyle [ 07/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jeff.yemin I agree and have communicated this with the customer.
We will open a help and upload the application logs with DEBUG level covering the next occurrence of the issue. You may close this ticket and thank you for your help.
Sheila | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sheila Doyle [ 06/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi jeff.yemin,
TRACE level logging has been provided and the customer confirms that the logs are all in UTC. Please let me know if you need any more information.
Best regards,
Sheila | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sheila Doyle [ 06/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello jeff.yemin,
Attached the latest logs provided by Jork from the client side (INFO level logs). So far the issues has since not returned yet. I will check out the cluster resources for the next occurrence of the issue, Jork to provide a timestamp, also uploading screenshots showing network traffic. For the last to days I had a JMX profiler connected to the application and enable CPU profiling but nothing showed up. With the exception on 05-03 07:33 I only see a small increase in network traffic, not even reaching the 1Mb/sec whereas on the 04-23 it was around 8Mb/sec (by the read application). Looking at the mongo network metrics these lines up with spikes on the cluster. Jork to provide logs with DEBUG level covering the next occurance, please let us know if you require anything else.
Best regards,
Sheila | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 03/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you re-run your tests with TRACE logging in the driver enabled and share the application logs? See http://mongodb.github.io/mongo-java-driver/4.2/driver-reactive/reference/logging/ for details. Probably best to share them on the Support ticket rather than here. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jork Zijlstra [ 03/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The one thing I just realized is that I actually can't say if the cpu starvation is the cause or the effect yet. Having the exceptions might also trigger in higher cpu usage instead of the other way around. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jork Zijlstra [ 03/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The issues from the 7 minute windows this morning are actually matching up with cpu starvation again. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jork Zijlstra [ 03/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attaching the logs from the clients side issues this morning. We had a 7 minute window where these occurred after which the application restarted at 07:39 which is matching up with the connection reset by peer message on the mongo master. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jork Zijlstra [ 03/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Currently we deployed the application with twice the cpu and putting some load on it. We do see that is became more stable, since the application is no crashing, but the exceptions are still happening. An additional exception I found was (2021-05-03T07:39:29.995Z):
I'm also still debugging everything / going through logs | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jork Zijlstra [ 03/May/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Sheila, Thanks for researching this further. I wondering how you came to this: We see that at this exact time a query triggered a scan of more than 12 million objects which indicates that the workload is not optimised. I have attached a screenshot of the database and as you can see this database has 5 collection where the biggest one only hold 1328 records. Far less than the 12 million records you are mentioning.
I'm suspecting you searched adifferent stack which indeed has an action collection which holds 12 million records, not this stack where we reported this issue for? All our queries are find by 2 id's, something like {Id1: "", Id2: ""} and am sure that all our queries use indexes that we created for those actions. Regards, Jork | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sheila Doyle [ 30/Apr/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Adding the diagnostic data metrics for timestamp of incident 11:54:24 on 23/04/21, We can see there was an operation that involved scanning 12 million objects.
Also we see that 102 connections dropped at the same time.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jork Zijlstra [ 30/Apr/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello @jeff.yemin, I did a quick test but it seems to behave the same. It's actually quite hard to simulate. You need to generate just enough traffic so that queries won't start timing out. I wouldn't say yet that using Netty can be a fix or not. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sheila Doyle [ 29/Apr/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jeff.yemin adding comment from jork_zijlstra observation for completeness:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jork Zijlstra [ 29/Apr/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello @jeff.yemin, I can try to see if I can spent some time on this next week. We used to use this NettyStreamFactory when using the 2.8.0 driver so it shouldn't be too big of a change. As for support we do have a ticket open: https://support.mongodb.com/case/00769901
Regards, Jork | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 28/Apr/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jork.zijlstra@nielsen.com, could you try one experiment for me? Try using Netty as the underlying transport and see if you can still reproduce the exception. The code for that looks something like this:
You'll have to take a dependency on Netty if you don't have one already. Thanks, Jeff
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 28/Apr/21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for reporting this. What you're seeing is definitely unexpected, even in the face of client resource constraints. We will do some investigation and get back to you soon with next steps. I also want to note some other resources to get this question answered more quickly:
Let me know if you've already tried any of the above. Thank you! |