[JAVA-1825] 100% CPU with find request during a primary change Created: 20/May/15 Updated: 11/Sep/19 Resolved: 23/Jun/15 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | None |
| Affects Version/s: | 3.0.1 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Julien Buret | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
On server side I have a replica set with one primary and one secondary. I trigger a step-down (ie I switch the primary and the secondary) and during this operation, the client send a find request using the async java driver. -> 100% CPU for ever on client side When profiling, all the CPU is wasted in com.mongodb.connection.AsynchronousSocketChannelStream$BasicCompletionHandle#completed with this kind of stacktrace : Thread-26@11189 daemon, prio=5, in group 'main', status: 'RUNNING' When no request is sent during the step-down, all is ok |
| Comments |
| Comment by Julien Buret [ 23/May/15 ] | ||||||||||||||||||||||||||||||
|
top says 100% CPU, and so does the visualvm monitor, as soon as I stepDown. If i kill the primary, I've got this :
... and 100% CPU Anyway if you don't reproduce, you don't reproduce. This issue is not blocking for me, and I will just reboot the instances when compose.io stepdown (which happens rarely) | ||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 22/May/15 ] | ||||||||||||||||||||||||||||||
|
What I mean is that I started the program when there is no primary available to execute the query on. An rs.stepDown() is one way to do that, but just killing the primary is another. From what I can see from the small log sample above, the driver is doing the right thing here. It discovers the new primary on port 27018, and then discovers that the the server on port 27017 is now a secondary. As for the CPU usage, I actually don't see evidence of high CPU usage in the CSV file that you provided. The way I understand the data in the CSV file, the Time in column B is a percentage of total time, including wait time. The actual CPU usage is in column D, and is much lower. You can see that very clearly for the main thread starting in row 3: the time is quite high, but it's all ultimately attributed to Thread.sleep() and actual CPU usage is very low. And it's the same for AsynchronousSocketChannelStream$BasicCompletionHandler.completed: the total time is relatively high, as is the percentage, but CPU usage is very low, and this is because ultimately the method is waiting on I/O to complete and is not doing much real work on the CPU. If you'd like to verify the actual CPU usage of the Java driver, please try a command line tool like top. | ||||||||||||||||||||||||||||||
| Comment by Julien Buret [ 22/May/15 ] | ||||||||||||||||||||||||||||||
|
What do you mean by "run with no primary". With a rs.stepDown() I got :
but no timeout exception | ||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 21/May/15 ] | ||||||||||||||||||||||||||||||
|
Can you try adding a shorter server selection timeout to ClusterSettings builder:
When I do that to your program, and run with no primary, it prints this out 1 second after each async call to find:
I'd like to confirm that we're at least seeing roughly the same behavior before looking at your profiling results. | ||||||||||||||||||||||||||||||
| Comment by Julien Buret [ 21/May/15 ] | ||||||||||||||||||||||||||||||
|
Yes it seems the request never times out.
The change of minHeartbeatFrequency (I tried 1, 3, 5, 10 and 30s) has no effect, sometimes I got this exception (but the cpu issue remains) :
I build a simple test case to reproduce, the result csv export of visualvm is attached :
| ||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 21/May/15 ] | ||||||||||||||||||||||||||||||
|
Are you saying that the find request never times out? In my testing I generally see the following exception delivered to the callback after 30 seconds, if the primary doesn't come back within that time:
I also don't see anything close to 100% CPU usage when I step down the primary. What platform is your Java program executing on? What version of the JDK? And how are you measuring the CPU usage? To help get a better understanding of what's going on in your environment, please try setting higher values for ServerSettings.minHeartbeatFrequency, which defaults to 500 milliseconds, and report what CPU usage you're seeing for different values of that property For example, this will set it to 5 seconds:
|