[SERVER-18056] 2d nearSphere performance regression Created: 14/Apr/15  Updated: 25/Jan/17  Resolved: 07/Aug/15

Status: Closed
Project: Core Server
Component/s: Geo
Affects Version/s: 3.0.2
Fix Version/s: 3.1.7

Type: Bug Priority: Major - P3
Reporter: Michael Kania Assignee: Siyuan Zhou
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 2d-regression-date-range.tar.gz    
Issue Links:
Depends
depends on SERVER-19039 geoNear scans the same index cells mu... Closed
Related
is related to SERVER-17929 Add full query support for $meta valu... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Set up a mongo nodes running 2.6 and 3.0 and import the attached dataset

mongoimport -d test -c test_collection --jsonArray 2d-regression-date-range.json 

Create 2d index and _created_at index

db.test_collection.createIndex({ "location": "2d" })
db.test_collection.createIndex( { "_created_at": 1 } )

Compare number of scanned documents and query duration on 2.6 and 3.0

db.test_collection.find({ location: { $nearSphere: [ 106.6331, 10.7395 ] }, 
_created_at: {"$lte" : ISODate("2015-01-23T17:59:44Z"), "$gte" : ISODate("2015-01-23T01:59:44Z")}}).limit(100)

Sprint: RPL 6 07/17/15, RPL 7 08/10/15, RPL 8 08/31/15
Participants:

 Description   

So just querying for coordinates appear to be much better thanks to SERVER-17469. However, after more testing I noticed that if I include a date range in the query there is a pretty significant performance regression between 2.6 and 3.0. I've attached a second export(2d-regression-date-range.tar.gz) that is the same as the previous export but with an additional _created_at field.

2.6
millis: 81
nscanned: 35252
nscannedObjects: 20345
3.0
executionTimeMillis: 2726
totalKeysExamined: 1037671
totalDocsExamined: 744461



 Comments   
Comment by Siyuan Zhou [ 07/Aug/15 ]

dynamike@fb.com, I believe this issue has been fixed in 3.1.6 and I am going to resolve it. Feel free to reopen the ticket if the problem continues.

Thanks,
Siyuan

Comment by Siyuan Zhou [ 28/Jul/15 ]

dynamike, it seems like 3.1.6 works well with rocksdb on our continuous integration system.

Link to binaries can be found in the compile task.

Comment by Michael Kania [ 27/Jul/15 ]

Looks great Siyuan. We will try testing 3.1.6 as soon as we can. We are using the rocksdb storage engine, and I believe it's not compatible with 3.1 yet.

Comment by Siyuan Zhou [ 24/Jul/15 ]

Hi dynamike and asya,

We introduced several geo performance improvements in the latest development release 3.1.6. We are able to improve this performance of 2dsphere index by 20X on the sample query.

3.1.6 - 2dsphere V2
"executionTimeMillis" : 1875,
"totalKeysExamined" : 24335,
"totalDocsExamined" : 41848,
 
After reindex
3.1.6 - 2dsphere V3
"executionTimeMillis" : 94,
"totalKeysExamined" : 21676,
"totalDocsExamined" : 38176,
 
Compared to 2d
3.1.6 - 2d
"executionTimeMillis" : 359,
"totalKeysExamined" : 95671,
"totalDocsExamined" : 112968,

I would appreciate it if you could give it a try in your testing environment. 3.1.6 can be found on the download page.

3.1.6 introduces version 3 of 2dsphere index while old index versions are still supported, reindex of 2dsphere index is necessary to get the most of the benefits.

Comment by Ramon Fernandez Marina [ 28/May/15 ]

andrey.hohutkin@gmail.com, there's been some design discussions internally but this ticket has not been scheduled yet, so we're not able to provide time estimate. We'll update the ticket with any further updates, stay tuned.

Regards,
Ramón.

Comment by Andrey Hohutkin [ 28/May/15 ]

Hi siyuan.zhou@10gen.com!

I already opened an issue and got the answer here: SERVER-18426.

I just want to know when I should expect a fix for the issue.

Comment by Siyuan Zhou [ 26/May/15 ]

Hi andrey.hohutkin@gmail.com,

Could you please tell us more about your use case and the performance problem you are facing with? Example dataset and slow queries that reproduce the performance problem would be very useful for us to design and test geo features.

Thanks,
Siyuan

Comment by Andrey Hohutkin [ 26/May/15 ]

From my perspective dealing with this issue makes a lot of performance problems. In my specific case, for example, we have a live product that experiences big slowdown because of that.
Resolving this issue ASAP is not desirable feature for me as it described. It is "a must" to continue growing with a product. Please, make a decision when you're planning to fix it.
Until it fixed I'm struggling with growing of users count.

Comment by Asya Kamsky [ 04/May/15 ]

Yes, that's correct, MongoDB calculates distances for $nearSphere using spherical geometry.

Comment by Michael Kania [ 01/May/15 ]

Asya,
What's the functional difference between $near and $nearSphere? Is it just that near calculates results based on a 2d plane whereas $nearSphere includes a curvature calculation

Comment by Asya Kamsky [ 01/May/15 ]

dynamike

There is another workaround available to you to avoid the worse performance - since these are 2d indexes, you can use $near rather than $nearSphere and you will get significantly better performance - a quick test on my laptop on your dataset showed $near against 2d index performing almost 10x as fast as $nearSphere (for reasons already given by siyuan.zhou@10gen.com).

Asya

Comment by Siyuan Zhou [ 23/Apr/15 ]

Hi dynamike, I totally understand the difficulty of building new indexes. The behavior of 2d index changed in a sizable code refactoring of SERVER-5800 to leverage all the advantages of new query framework, because the old code didn't fit into it very well. The code refactor gives us some important advantages, including:

  • Large result set, no default limit of 100 documents
  • Skip and limit
  • Enable yielding

For this particular issue, the code refactoring changed the way of scanning 2d index, which is the reason of performance drop, while we've seen performance gain in some test cases. I believe there is some room to improve geo near performance, but it's not scheduled yet. I've seen slight performance drop of 2dsphere between 2.6 and 3.0. From profiling, I cannot find anything suspicious since the code of 2dsphere index haven't changed too much. Our query team are working on improving general query performance, so geo near will also benefit from that. If we decide to keep digging into geo near's performance, I'll update this ticket.

igor - forcing geo near query to use geo index is not new behavior and it's not the root cause of this issue. The major reason is that $near/$nearSphere query operators have no real query planner support and poor semantics. Please track SERVER-17929 for discussion about the alternatives.

Thanks,
Siyuan

Comment by Igor Canadi [ 16/Apr/15 ]

> Geo near query is forced to use geo index, even though the index on _created_at seems more selective.

Is this new behavior in 3.0 and the reason why these queries are slow? What is the underlying reason that geo queries are forced to use geo index?

Comment by Michael Kania [ 15/Apr/15 ]

I get that adding compound indexes will make performance better. Pushing out a change to add compound indexes in our environment is extremely difficult, because we manage hundreds of thousands of indexes. I still don't understand why performance is much faster in 2.6. Even if we switch to 2dsphere indexes 2.6 is faster than 3.0.

Comment by Siyuan Zhou [ 15/Apr/15 ]

Hi dynamike,

I am able to reproduce this issue with your dataset. The given query takes 2723 ms. Geo near query is forced to use geo index, even though the index on _created_at seems more selective. Also, geo near query is relatively slow when searching a large area of high density, which is a known issue. I don't think there is any low hanging fruit, like SERVER-17469, to improve its performance. I'll keep this issue on my radar when tuning the performance in the future.

That said, we have some workarounds for this case.

2dshpere index

As explained in SERVER-17469, 2dsphere index is more efficient because of the compensation of map projection for 2d index. In this case, it indeed gives better performance.

"executionTimeMillis" : 368,
"totalKeysExamined" : 153868,
"totalDocsExamined" : 69543,

Compound index: 2d + _created_at

Compound index avoids fetching those documents that don't match _created_at, so totalDocsExamined drops dramatically.

"executionTimeMillis" : 168,
"totalKeysExamined" : 122163,
"totalDocsExamined" : 606,

Compound index: 2dsphere + _created_at

Combining both of above gives the best result.

"executionTimeMillis" : 8,
"totalKeysExamined" : 1510,
"totalDocsExamined" : 428,

Generated at Thu Feb 08 03:46:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.