[SERVER-26974] Poor 2dsphere / $near performance and inconsistent object scanning Created: 09/Nov/16  Updated: 19/May/17  Resolved: 13/Dec/16

Status: Closed
Project: Core Server
Component/s: Geo, Querying
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Fabien Allanic Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-26518 Include density estimator in geoNear ... Backlog
related to SERVER-18426 $geoNear expands aggressively if the ... Backlog
Operating System: ALL
Sprint: Query 2017-01-23
Participants:

 Description   

It looks like the $near query performance is inconsistent based on the data set and query parameter.

It is very easy to reproduce, I created a test database using the restaurants dataset from the $near docs

https://docs.mongodb.com/v3.2/tutorial/geospatial-tutorial/
https://raw.githubusercontent.com/mongodb/docs-assets/geospatial/restaurants.json

I imported this dataset in a collection, and created an index on the `location` attribute.

In a mongo shell if I run this query with explain

//near mexico
db.restaurants.find({ location: { $near: { $geometry: { type: "Point", coordinates: [ -109.16015625, 17.5602465032949 ] } } }}).limit(1).explain("executionStats")

this is the result I get

"executionStats" : {
        "executionSuccess" : true, 
        "nReturned" : NumberInt(1), 
        "executionTimeMillis" : NumberInt(203), 
        "totalKeysExamined" : NumberInt(25344), 
        "totalDocsExamined" : NumberInt(25340), 

if I run this query

//near Denver CO
db.restaurants.find({ location: { $near: { $geometry: { type: "Point", coordinates: [ -104.9955372, 39.7642543 ] } } }}).limit(1).explain("executionStats")

then I get

"executionStats" : {
        "executionSuccess" : true, 
        "nReturned" : NumberInt(1), 
        "executionTimeMillis" : NumberInt(2), 
        "totalKeysExamined" : NumberInt(7), 
        "totalDocsExamined" : NumberInt(4), 

Both queries are identical with a limit set to 1 except for the coordinates in the query, but one scans all the objects and takes 200ms (if far from the locations in the dataset), and the other one scans just a few objects and takes 2ms (if within the US).

Am I doing something wrong?

I was able to witness this on both WiredTiger and MMAP with MongoDB 3.2.10
Not sure if it happens with earlier versions



 Comments   
Comment by Fabien Allanic [ 14/Dec/16 ]

Thanks for the workaround.
I personally believe this work should be handled by the database itself, not application level workarounds.

The problem must be the way the data itself is indexed, but you guys are the experts, I have never built a database.
Just being honest, I'll likely be looking at Mongo alternatives for my next projects with geoqueries, I can't believe all databases require to guess where the points are before sending a query or not.

That being said, I appreciate the follow up instead of letting this issue hang forever

Comment by David Storch [ 13/Dec/16 ]

The MongoDB Query Team has reviewed this request and will be closing as Works as Designed. Although the system could undoubtedly be faster for queries like this (things can always be better and faster!), it is fundamentally the case that searching outwards for the nearest matching data point from a faraway search point is a more expensive operation than searching for a nearby point. The only option I can think of offhand to optimize this use case would be to check whether any data exists within a very large starting radius upfront, without first looking at successively expanding annuli until data is found. This, however, would come at the expense of other queries, since the database would be performing extra work in the common case that the query point is close to the data points.

Note that checking for data upfront could be a useful workaround at the application level. Suppose you have a query point p which the application suspects could be far away on the globe from any data points. The application could first issue a $centerSphere query centered at p with a large radius and with a limit of 1. If the database indicates that no data is found, then you can attach this large radius as a $minDistance to the actual $near query. Otherwise you issue the $near query with no $minDistance.

Comment by Fabien Allanic [ 17/Nov/16 ]

I hope what I am explaining makes sense.

Thanks for sharing this blog post https://emptysqua.re/blog/paging-geo-mongodb/ but it's not going to help us unfortunately.
Our problem is we cannot predict where the user will be located, so where the query center will be compared to the data set.
If you have another idea for a workaround, I'd love to hear your suggestions.

I also hope somebody can fix it at the core of the database, I strongly believe this is a very common use case, it should not require people to notice the bad performance of the index and then use a workaround.

Thanks in advance!

Comment by Fabien Allanic [ 15/Nov/16 ]

Thanks for your answer.

Generally, the distribution of query center tends to be similar with the data distribution.

Correct, this is generally the case, but not always.
One of our customer has locations in the US only, but on their website they get visits from Mexico which trigger slow queries, this is what I tried to reproduce in my example in this ticket.
This is a pretty common use case I imagine.

Comment by Kelsey Schubert [ 15/Nov/16 ]

Hi fallanic,

This behavior is not documented, and I would be happy to open a DOCS ticket on your behalf. However, I'd like to gather a bit more information about your use case first. Generally, the distribution of query center tends to be similar with the data distribution. Would you please describe your use case? And why this assumption doesn't hold?

Thank you,
Thomas

Comment by Fabien Allanic [ 14/Nov/16 ]

HI Thomas,

thanks for the answer, is this behavior explained anywhere in the official docs (i.e. index not working when querying a bit too far from the points in the dataset)?

If not it should be explained somewhere because this is kind of a big deal (and would save developers a lot of hours debugging a non-issue on their end).

Thanks for linking my ticket to the other one, I hope some optimizations for these queries will be implemented in the future.

Comment by Kelsey Schubert [ 14/Nov/16 ]

Hi fallanic,

Thank you for the detailed report. This is expected behavior as the density estimator increases the search radius until points are inside. Once points are identified inside of the radius, they must still be sorted before the first can be returned.

When the selected coordinate is close to any data points the search annuli is small and only a few records need to be sorted. However, if the selected coordinate if far from any data points each successive search annuli is larger and the annuli that eventually contains the records will have a great deal more that need to be sorted.

We're still investigating whether there are optimizations that can be made to our codebase to improve performance in this case, and I've linked this ticket to SERVER-26518, which would show this behavior in the explain output.

I'd also recommend taking a look at this blog post by a MongoDB employee which solves a related issue by setting minDistance on queries from the app layer.

Kind regards,
Thomas

Comment by Fabien Allanic [ 14/Nov/16 ]

Hi,

any update regarding that issue?

Thanks.

Generated at Thu Feb 08 04:13:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.