[JAVA-3612] Overhead from Java driver during Linkbench Created: 27/Jan/20 Updated: 28/Oct/23 Resolved: 06/Apr/20 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Performance |
| Affects Version/s: | 3.11.0, 3.12.0 |
| Fix Version/s: | 4.1.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Mark Callaghan (Inactive) | Assignee: | Ross Lawley |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | perfteam-wishlist | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux |
||
| Attachments: |
|
| Description |
|
I am working on a benchmark (Linkbench) for MongoDB. When I run the same workload for MongoDB 4.2.1 and MySQL 5.7.28 I see ~2.5X more CPU time in the Linkbench client for MongoDB than for the MySQL client. The client accounts for about 1/3 of total CPU time on the benchmark so my hope is that this can be reduced. The client includes generic code, per-DBMS code (MongoDB, MySQL) and code from the per-DBMS driver. Most of the overhead is from BSON decode & encode functions. I will try using a codec next but reducing CPU overhead in the Java driver will still be useful for clients that don't use a code. I attached CPU profiler output from VisualVM. Unfortunately the tool must be used to explore it. I have taken screenshots from the profiler tool but they have been hard to read. I can provide screenshots if they might help. But exploring via VisualVM is the best bet. |
| Comments |
| Comment by Githook User [ 06/Apr/20 ] | |||||||||||||||
|
Author: {'name': 'Ross Lawley', 'email': 'ross.lawley@gmail.com', 'username': 'rozza'}Message: Perf: ByteBufferBsonInput performance improvements Improve readCStrings performance.
| |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 03/Mar/20 ] | |||||||||||||||
|
The improvement was significant from the perspective of the CPU profiler. The improvement was small from the perspective of the CPU consumption I measure from the Java client in my repro case. While this is always a risky way to judge – the code also appears to be more performant based on reading it, although my Java perf skills are not current. So I hope that the change gets committed. | |||||||||||||||
| Comment by Ross Lawley [ 03/Mar/20 ] | |||||||||||||||
|
Hi mark.callaghan, Before I close this ticket out can I ask for clarification; is it your recommendation to make those changes to the driver? Or was the real result negligible? Ross | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 12/Feb/20 ] | |||||||||||||||
|
For the scope of this Jira ticket I am not asking for additional changes. But I am happy to discuss future perf projects and run tests to support the changes. | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 12/Feb/20 ] | |||||||||||||||
|
Uploaded snap.5m.new.nps and snap.5m.old.nps which are CPU profiles where I wait 5 minutes before attaching. Added this to command line -XX:+PrintCompilation to get feedback on JIT and that gets somewhat quiet after ~3 minutes. So I attached the CPU profiler after it was running for 5 minutes and this is the result. From the JIT output the important methods had all reached "tier 4" optimization per this article I am still unable to show a benefit to Linkbench throughput from this change. But I can show a reduction in CPU for readCString. I will repeat the analysis I did before, with a bit more detail. As above, A% / B% is percentage of CPU time for old and new (old is 3.12 driver, new is 3.12 driver with your BSON changes). I use "your change made this better" to highlight how your BSON changes helped. At top level there is 63.2% / 53.2% in LinkBenchRequest.getLinkList and 35.4% / 46.5% in LinkBenchRequest.getLinkListTail. The split between getLinkList and getLinkListTail was closer to 50:50 in what I explained above. Both of these functions spend all of their time in LinkStoreMongoDb2.getLinkList. The first thing is that the queries in getLinkListTail are getting larger result sets, thus the big difference in iterator vs hasNext overheads. For LinkBenchRequest.getLinkList which is 63.2% / 53.2% of CPU time: For LinkBenchRequest.getLinkListTail which is 35.4% / 46.5% of CPU time Now I follow the LinkBenchRequest.getLinkList call chain. First up is the hasNext path that was 12.2% / 11% of total CPU. Deep in the call chain is this:
Still on the LinkBenchRequest.getLinkList call chain. Next up is the MongoIterableImpl.iterator call chain that is 49.2% and 39.4% of CPU time. The first thing I see is ~2.5% of CPU time (this is 2.5% of 49.2% or 39.4%) in FindIterableImpl.asReadOperation via Operations.createFindOperation. Of that 2% is Filters$AndFilter.toBSonDocument and the rest is spread over many callees. Ignoring that I can drill down from executeCommand to reach InternalStreamConnection.sendAndReceive which is 40.7% / 31.7% (the top of this call is 49.2% / 39.4%). And there I see:
To keep this readable I will stop describing this call chain. But the CPU overheads are distributed over many places. Now I follow the LinkBenchRequest.getLinkListTail call chains that are 35.4% / 46.5% of CPU time. I drill down to MongoBatchCursorAdapter.hasNext which is 33.8% / 43.9% of CPU time. Then I continue drilling down to InternalStreamConnection.receiveCommandMessageResponse which is still 33.8% / 43.9% of CPU time. Eventually I reach CommandResultArrayCode.decode which is 33.3% / 42.9% of CPU time, it calls DocumentCodec.decode which is 31.4% / 40.8% of CPU time and in the children of DocumentCodec.decode I see:
One way to see the impact of your change is the ratio of time in readCString vs readByte above. | |||||||||||||||
| Comment by Jeffrey Yemin [ 12/Feb/20 ] | |||||||||||||||
|
Essentially you have to warm the JVM by running the code under test for "a while" before starting measurement. | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 12/Feb/20 ] | |||||||||||||||
|
I am reading this now - https://www.infoq.com/articles/OpenJDK-HotSpot-What-the-JIT/ | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 12/Feb/20 ] | |||||||||||||||
|
> Question about the CPU profiler: is it profiling the JVM in a scenario where the HotSpot compiler has a chance to > kick in and do all its dynamic inlining optimizations? jeff.yemin - that is an excellent question, I wondered about that too. I stopped doing serious Java development around 2000, so I am rusty on this topic. Right now I start the test, wait ~60 seconds, then attach the CPU profiler. I need to confirm what should be done to let the JVM do its magic before I collect profiles. I use Ubuntu 19.10 with:
| |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 12/Feb/20 ] | |||||||||||||||
|
I used VisualVM for the CPU profile and collected ~43 seconds of CPU time. All of the time is in the thread running Linkbench code, so the extra CPU isn't from background threads. 99% of the time is from two calls, which eventually end up spending all of their time in LinkStoreMongoDb2::getLinkList so I will compare the CPU overheads from one of these calls with and without your change to the BSON jar – ("new" is with your change, "old" is without). I attached the CPU profiler snapshots as snap.new.nps and snap.old.nps and try to describe them below. I don't do much Java development. These were created by VisualVM, which seems OK so far. The good news is that if you search below for "your change made this better" you will see where your code might have removed ~20% of the CPU overhead. It is ~10% of the CPU in the case I show, but there are two callers down that call chain that are each about 50% of CPU (100% in total). Although I am struggling to understand why I have yet to show the benefit from this change on transaction throughput. I have exported the VisualVM snapshot files (*.nps format) for this but will try to describe it here. I start with a function that is ~100% of total CPU time, but called from two places so VisualVM doesn't collapse that. So the max CPU times below will be ~50%. The call stack below is 50.1% / 47.7% of total CPU for old / new The interesting part of the call stack with CPU times for old / new: Drilling down from where I left off above The children of BsonBinaryReader.readBsonType which was (28.6% / 20.9%) are: The children of DocumentCode.readValue which was (11.7% / 15.4%) are below. Note that the percentages for "new" on the RHS can be larger because of the CPU savings from your change above as these are total CPU overheads, not CPU overhead per call: | |||||||||||||||
| Comment by Jeffrey Yemin [ 12/Feb/20 ] | |||||||||||||||
|
mark.callaghan thanks for re-running these tests. Question about the CPU profiler: is it profiling the JVM in a scenario where the HotSpot compiler has a chance to kick in and do all its dynamic inlining optimizations? | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 12/Feb/20 ] | |||||||||||||||
|
Repeated the tests and was more careful about what I measured. From this I don't see a benefit. I am looking at CPU profiles now and will update this soon. Legend:
The tests are t1, t2, and t3. t1 has a mix of transactions, t2 is 91% range queries, t3 is 99% range queries. The range query is the one for which your change should help as that returns a large result set and decoding it was a CPU problem per the CPU profiler.
| |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 11/Feb/20 ] | |||||||||||||||
|
Did the manual step as explained above, then confirmed that the jar file I use for running my test has the changes by doing "jar xf" and looking at the checksum for each class file. With the change I see ~4% reduction in CPU time. I hoped for more and will revisit this via a CPU profiler in a few days. I appreciate your prompt work on this but officially I work part-time (and not today). | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 11/Feb/20 ] | |||||||||||||||
|
Trying it now and thanks for the tips on mvn. I haven't done much with Java so I struggle with some of the tool chains. I get this error from gradlew publishToMavenLocal. Can I just copy the jar there manually and run "mvn -o" to not clobber that jar file? > Task :bson:signMavenJavaPublication FAILED
| |||||||||||||||
| Comment by Ross Lawley [ 11/Feb/20 ] | |||||||||||||||
|
Hi mark.callaghan, I've created a branch with proposed fixes: https://github.com/rozza/mongo-java-driver/tree/j3612.3.12 To create the 3.12.2-SNAPSHOT jar you can either run:
Ross | |||||||||||||||
| Comment by Ross Lawley [ 28/Jan/20 ] | |||||||||||||||
|
Initial PR: https://github.com/rozza/mongo-java-driver/pull/357
| |||||||||||||||
| Comment by Ross Lawley [ 28/Jan/20 ] | |||||||||||||||
|
Thanks mark.callaghan, Great insights, hopefully we can improve upon those figures with the simple fixes identified. Ross | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 27/Jan/20 ] | |||||||||||||||
|
My suggestion is to call the equivalent of ensureAvailable and ensureOpen once per call to readUntilNullByte. But even with that change there will still be several calls to ensureOpen, just not one call per byte. | |||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 27/Jan/20 ] | |||||||||||||||
|
Looking at CPU profile again, most CPU is on the decode side. One call on the client accounts for 60% of CPU. When I drill down into it total CPU is still 52% at org.bson.codecs.DocumentCoder.decode. Below that:
For the children of readBsonType:
children of readCString
Also, there might be >= 10% of total CPU time from org.bson.io.ByteBufferBsonInput.ensureAvailable and ensureOpen. readUntilNullByte calls readByte in a loop. readByte and readByte calls ensureAvailable and ensureOpen per byte according to https://github.com/mongodb/mongo-java-driver/blob/master/bson/src/main/org/bson/io/ByteBufferBsonInput.java That might be the easiest thing to change. |