[JAVA-3077] Java driver upgrade causing slowness to some calls Created: 08/Nov/18  Updated: 11/Sep/19  Resolved: 12/Nov/18

Status: Closed
Project: Java Driver
Component/s: Performance, Query Operations
Affects Version/s: 3.4.2
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Manan Shah Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
backported by JAVA-2138 Performance regression between OP_QUE... Closed
Duplicate
duplicates JAVA-2110 Performance regression between 2.13 a... Closed

 Description   

We are in process of upgrading java driver from version 2.14 to 3.4.2. The MongoDB server ver is at 3.2.19 and the goal is to upgrade server version to 3.4 once driver has been successfully upgraded. We are experiencing slowness to some calls on the new driver interacting with server version 3.2. Although I won't be able to link the app code, here's some comparison details between old and new driver. These are all findAll() queries.

2.14 Driver Query time
Run 1 291ms
Run 2 215ms
Run 3 106ms
Run 4 210ms
Run 5 182ms
Avg 277ms

 

3.4.2 Driver Query time
Run 1 599ms
Run 2 398ms
Run 3 218ms
Run 4 396ms
Run 5 294ms
Avg 491ms

We added more timing information around different calls and compared between the Old and new driver. Extracting some info here.

Old Driver:

 2018-11-08 01:46:23,083 DEBUG START:1541663167 [util.BatchMessageIterator] [main] Message query is: { "_id" : { "$gt" : { "$oid" : "5be3e89c498e8dc38e92c2af"} , "$lte" : { "$oid" : "5be3e9ce498e8dc38e92c2b0"}}}
2018-11-08 01:46:23,090 INFO  START:1541663167 [util.WallClockTimer] [main] Timing around first data.hasNext() call took 6 ms
2018-11-08 01:46:23,091 INFO  START:1541663167 [util.WallClockTimer] [main] Timing around data.size() call took 1 ms
2018-11-08 01:46:23,092 INFO  START:1541663167 [util.WallClockTimer] [main] Timing around data.next() call took 0 ms
2018-11-08 01:46:23,290 INFO  START:1541663167 [util.WallClockTimer] [main] Timing around third data.hasNext() call took 0 ms

New Driver:

 2018-11-08 00:12:55,195 DEBUG START:1541657547 [util.BatchMessageIterator] [main] Message query is: { "_id" : { "$gt" : { "$oid" : "5be3d24a593a9d2842725f20"} , "$lte" : { "$oid" : "5be3d3e6593a9d2842725f21"}}}
2018-11-08 00:12:55,381 DEBUG START:1541657547 [event.ThreadLocalCommandListener] [main] Command [find] succeeded and took 185 ms
2018-11-08 00:12:55,383 INFO  START:1541657547 [util.WallClockTimer] [main] Timing around first data.hasNext() call took 187 ms
2018-11-08 00:12:55,482 DEBUG START:1541657547 [event.ThreadLocalCommandListener] [main] Command [count] succeeded and took 1 ms
2018-11-08 00:12:55,483 INFO  START:1541657547 [util.WallClockTimer] [main] Timing around data.size() call took 99 ms

Looking at that data, we noticed:
1. For data.size() calls, old driver consistently takes ~1ms, and new driver takes from 8ms to 99ms
2. For data.hasNext() calls, seems like the first instance of hasNext() call with old driver takes from 1ms to 6ms, and new driver is taking from 5ms to 389ms. But the other hasNext() calls seems fine?
3. For data.next() calls, there's not much difference between old driver and new driver.

 

Looking for insight if you're aware of any bug in the new driver version that could possibly cause this behavior? Any help would be great for us to proceed with upgrading our mongoDB server versions across the board.



 Comments   
Comment by bfink13 [ 12/Nov/18 ]

Thanks Ross - I also just found JAVA-2138 which looks to be similar. Do you have any kind of timeline for the investigation?

Comment by Ross Lawley [ 12/Nov/18 ]

Hi bfink13,

My apologies, I was incorrect, it is lazy like in 3.x - so hasNext() will cause the first call to the database to get the initial batch.

There are two main differences between the driver versions that we believe account for the speed difference:

1) The differences between the wire protocol used between the versions and how MongoDB handles those paths: OP_QUERY v's OP_COMMAND.
2) The differences in handling BSON data. The 2.x version of the driver is much simpler but less flexible compared to the 3.x codec approach.

There is another ticket open for checking / analyzing the performance differences between the driver versions: JAVA-2110 - please watch that ticket as updates and follow up work will come from that ticket.

Ross

Comment by bfink13 [ 09/Nov/18 ]

Hey Ross, I appreciate the quick response. (edit: apparently I can't tag people)

The reason we're going to 3.4.2 is because we got about 80% done with it over a year ago but had to put it on pause and now are finally picking it back up again. We'll be going to 3.8 and beyond soon

Here's a snippet of the code in question:

final DBCursor data = collection.find(queryObj)
    .setReadPreference(ReadPreference.primaryPreferred())
    .sort(new BasicDBObject("_id", 1))
    .limit(batchSize);
 
final WallClockTimer firstHasNextTimer = new WallClockTimer("Timing around first data.hasNext() call");
if (!data.hasNext()) {
    return endOfData();
}
 
firstHasNextTimer.logElapsed();
 
final WallClockTimer countTimer = new WallClockTimer("Timing around data.size() call");
final List<MyData> messages = Lists.newArrayListWithCapacity(data.size());
countTimer.logElapsed();
 

You can see example values of "queryObj" in the ticket description.

When you say "with 2.x the initial results are populated in the constructor", can you elaborate? Looking at the DBCursor implementation in 2.14.3, it looks like it only sets fields on the DBCursor object and doesn't perform any other work. Regardless, we'll wrap some timing around the collection.find() call and see if there is any performance difference there.

Let me know if you need more info.

Comment by Ross Lawley [ 09/Nov/18 ]

Hi manan@indeed.com & bfink13,

Thanks for the ticket, out of interest why upgrade to driver version 3.4.2? Just to let you know the latest 3.4.x version is 3.4.3 and the latest Java driver release is the 3.9.0.

Could you provide some example code so that the tests can be replicated?

Looking at that data, we noticed:
1. For data.size() calls, old driver consistently takes ~1ms, and new driver takes from 8ms to 99ms

This is surprising as both issue a count command to the database. Its unclear as what is the cause of the extra time given the command listener reports the command took 1ms. There is some post processing of the result from the server but that shouldn't account for 98ms. I'd expect the cost of the post processing to be consistent. The test code may provide more insight to cause of the pause - it is possible that the call coincided with garbage collection.

2. For data.hasNext() calls, seems like the first instance of hasNext() call with old driver takes from 1ms to 6ms, and new driver is taking from 5ms to 389ms. But the other hasNext() calls seems fine?

There is a slight difference between the driver versions as to when the initial batch of results from the server is requested. With 2.x the initial results are populated in the constructor, in the 3.x they are populated in the first hasNext() call. The timing would need to include the creation of the DBCursor and the first call to DBCursor#hasNext to reflect the differences between the two driver versions.

3. For data.next() calls, there's not much difference between old driver and new driver.

After the driver has been constructed in 2.x or the first hasNext() call in 3.x the initial batch of results are available - so calls to next() should have similar / the same performance.

I hope that provides some insight.

Ross

Comment by bfink13 [ 08/Nov/18 ]

I'm a coworker of Manan's. If it wasn't already clear, the ThreadLocalCommandListener logs shown above are emitted from a simple implementation of the CommandListener interface. In the "new driver" example, you can see that count() took only 1ms from the driver's perspective, but the wrapped DBCursor#size call took 99ms.

Generated at Thu Feb 08 08:58:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.