[SERVER-13568] Near search using find() with 2DSphere index is very slow vs. using a 2D index Created: 13/Apr/14  Updated: 09/Jul/16  Resolved: 17/Sep/15

Status: Closed
Project: Core Server
Component/s: Geo
Affects Version/s: 2.4.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Abraham Lopez Assignee: Unassigned
Resolution: Duplicate Votes: 4
Labels: geoNear
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Rackspace Performance cloud server with 4 GB RAM, 2 vCPUs and high-speed SSD data disk.


Issue Links:
Depends
Duplicate
duplicates SERVER-19039 geoNear scans the same index cells mu... Closed
Related
Operating System: Linux
Steps To Reproduce:

see below.

Participants:

 Description   

I've a database with over 3 million documents.

When running a find() on a GeoJSON Point field with a 2DSphere index the query is very slow (12,000 ms), while running the same find() using a 2D index is very fast (under 1 ms).

Steps to reproduce:

1. Create a collection named "objects" with more than 1 million documents.

2. Use the following simple schema:

{
      "id": 1,
      "location": {
            "type": "Point",
            "coordinates": [-118.491356, 34.02444]
      }
}

3. Create a 2DSphere index in the location field:
objects.ensureIndex(

{location: "2dsphere"}

);

4. Create a 2D index in the location.coordinates field:objects.ensureIndex(

{"location.coordinates": "2d"}

);

5. Run this 2DSphere search query on the mongo client:

db.objects.find({
	"location": {
		"$near": {
			"$maxDistance": 10000,
			"$geometry": {
				"type": "Point",
				"coordinates": [-118.491356, 34.02444]
			}
		}
	}
}).limit(1000).explain();

You'll notice a high number of scanned objects and a very high response
time, even though the S2NearCursor is being used. Here's the output I get from explain:

... }).limit(1000).explain();
{
	"cursor" : "S2NearCursor",
	"isMultiKey" : true,
	"n" : 1000,
	"nscannedObjects" : 1000,
	"nscanned" : 1842767,
	"nscannedObjectsAllPlans" : 1000,
	"nscannedAllPlans" : 1842767,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 7,
	"nChunkSkips" : 0,
	"millis" : 12565,
	"indexBounds" : {
		
	},
	"nscanned" : 1842767,
	"matchTested" : NumberLong(1695623),
	"geoMatchTested" : NumberLong(1695623),
	"numShells" : NumberLong(1),
	"keyGeoSkip" : NumberLong(147144),
	"returnSkip" : NumberLong(3577),
	"btreeDups" : NumberLong(0),
	"inAnnulusTested" : NumberLong(1695623),
	"server" : "mongoserver:27017"
}

6. Run this 2D search query on the mongo client:

db.objects.find({
	"location.coordinates": {
		"$near": [-118.491356, 34.02444],
		"$maxDistance": 10000
	}
}).limit(1000).explain();

Now you'll get a very fast response time. Here's the output I get from explain:

{
	"cursor" : "GeoSearchCursor",
	"isMultiKey" : false,
	"n" : 1000,
	"nscannedObjects" : 1000,
	"nscanned" : 1000,
	"nscannedObjectsAllPlans" : 1000,
	"nscannedAllPlans" : 1000,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		
	},
	"server" : "mongoserver:27017"
}

7. You can use geoNear instead of near when searching on the 2DSphere-indexed field and you'll get the same huge response time.



 Comments   
Comment by Siyuan Zhou [ 17/Sep/15 ]

This issue has been fixed by SERVER-19039 in 3.1.6, so I am closing this ticket as duplicated. Another ticket should be opened if there are further issues.

Thanks,
Siyuan

Comment by Abraham Lopez [ 26/Jul/15 ]

Awesome, will install and try the new version as soon as I've the chance (currently very busy to do this) and report back the results.

Comment by Siyuan Zhou [ 24/Jul/15 ]

Hi aplimovil,

We introduced several geo performance improvements in the latest development release 3.1.6. From the symptoms, I feel like this issue should be fixed by the recent changes. I would appreciate it if you could give it a try in your testing environment. 3.1.6 can be found on the download page.

3.1.6 introduces version 3 of 2dsphere index while old index versions are still supported, reindex of 2dsphere index is necessary to get the most of the benefits.

Comment by Siyuan Zhou [ 25/Sep/14 ]

If the stored data is GeoJSON point, then SERVER-15204 ticket is not related at all to the original issue.

Comment by Abraham Lopez [ 25/Sep/14 ]

Unfortunately the data I used is of private ownership by a client of mine, so I cannot share it.

So the SERVER-15204 ticket is not related at all to the original issue reported in this ticket, then?

Comment by Siyuan Zhou [ 24/Sep/14 ]

aplimovil, thanks for your update.

I believe SERVER-15204 is different from this issue as I commented on that ticket. I am very interested in your data distribution, since the near query depends on the data distribution heavily in some extreme cases. Near query searches an initial small shell around the given point and then expands the search area adaptively. I suspect our initial shell is way too big for this use case, given that 1842767 documents are scanned to just return 1000 results. We may tune the default settings or expose some parameters for users. I would appreciate if we can have the real-world sample data from you to understand this problem better, for example, the 1842767 documents in question. From your previous comment, I guess open street maps might be another good dataset.

Thanks,
Siyuan

Comment by Abraham Lopez [ 24/Sep/14 ]

Thanks Siyuan. I've added a comment to that ticket so you guys remember to let us know through this ticket when it's tackled, so we can test and confirm the improvement fixes this performance issue originally reported.

Comment by Siyuan Zhou [ 23/Sep/14 ]

heasleyb, thanks a lot for your sample dataset. I am able to reproduce the slow queries with $geoIntersects. It turns out that the parsing of a geometry takes most of the time, which includes the self-intersection test of polygon among other sanity checks and validations. This issue has been filed in SERVER-15204. Please track that ticket for updates.

This ticket is for the performance of $near/$nearSphere, if you have any performance issue with near search, feel free to update this ticket. Thanks again!

Comment by Brian Heasley [ 12/Sep/14 ]

We are using MongoDB v2.6.3. We're using the Maponics school data set (http://www.maponics.com/products/gis-map-data/school-boundaries/overview). I could give you the slow queries we are using but I don't believe we could provide the actual data due to licensing concerns without more discussion. If you wanted to go that route feel free to email me direct and we can see if we can work it out.

I appreciate you looking into this!

Comment by Siyuan Zhou [ 12/Sep/14 ]

aplimovil, ldsenow and heasleyb, Thanks for your feedback.

I am working on geo performance and looking into this problem. Could you please send us the real sample dataset and slow queries. You can attach the dataset to this ticket or just give us a link. If we are able to reproduce the exact same problem on our side, it will be very helpful for us to understand this issue and run CPU profiling. aplimovil's original comments and thomasr's reproduction is a good starting point, but we'd love to have more real data sets and queries.

Besides, which version are you using?

Thanks,
Siyuan

Comment by Brian Heasley [ 12/Sep/14 ]

We've had the same issue (very slow 2dSphere index queries) with Maponics school data, specifically in NYC where the density is high. Is there any update from the Geo engineer on whether this is a legitimate bug that might be addressed?

Comment by ldsenow [ 28/Jun/14 ]

Hi Abraham,

I am having exactly the same issue. Thanks to Thomas to point out the problem. I believe the performance is not acceptable and I am forced to create an old 2d index. I hope they can path it in 2.6.4 not 2.8

Comment by Abraham Lopez [ 30/Apr/14 ]

Hi Thomas,

Yes, that's exactly my case. Our database has millions of records (locations) which have coordinates that are very close together, as our database is comprised of thousands of renders of physical objects (buildings, roofs, trees, streets, etc.) of each major city in the USA, so the points in my collection are indeed very close together, so it seems this is a bug with the 2DSphere index when points in the collection are very close to each other.

As for radians, thanks for the tip, I was already aware of it

Comment by Thomas Rueckstiess [ 30/Apr/14 ]

Hi Abraham,

I spent some time trying to reproduce the issue today. I finally saw a similar result to yours (2dsphere very slow compared to 2d index), but only under a certain data distribution. Specifically, when all the points in the collection were really close together (in my case, they were uniformly centered around [-118.50, 34.02] with +/- 0.01 in each direction). For that particular case, I got a large discrepancy in performance, 42 seconds (2dsphere) vs. 12 milliseconds (2d). The commands and outputs are below.

mongo
MongoDB shell version: 2.4.6
connecting to: test
>
>
> db.objects.ensureIndex( {location: "2dsphere"} );
> db.objects.ensureIndex( {"location.coordinates": "2d"} );
>
> db.objects.count()
1000000
> db.objects.find().limit(10)
{ "_id" : ObjectId("53603264fe4dce73dd3d5034"), "location" : { "type" : "Point", "coordinates" : [  -118.50117097515667,  34.0287306670744 ] } }
{ "_id" : ObjectId("53603264fe4dce73de3d5034"), "location" : { "type" : "Point", "coordinates" : [  -118.50275264035425,  34.024570781782444 ] } }
{ "_id" : ObjectId("53603264fe4dce73df3d5034"), "location" : { "type" : "Point", "coordinates" : [  -118.49561566328414,  34.01364954589843 ] } }
{ "_id" : ObjectId("53603264fe4dce73e03d5034"), "location" : { "type" : "Point", "coordinates" : [  -118.49369748842913,  34.02139252927903 ] } }
{ "_id" : ObjectId("53603264fe4dce73df3d5035"), "location" : { "type" : "Point", "coordinates" : [  -118.49891705125842,  34.022663936252435 ] } }
{ "_id" : ObjectId("53603264fe4dce73df3d5036"), "location" : { "type" : "Point", "coordinates" : [  -118.50388040516694,  34.022565334932885 ] } }
{ "_id" : ObjectId("53603264fe4dce73df3d5037"), "location" : { "type" : "Point", "coordinates" : [  -118.49725202335392,  34.023129221606126 ] } }
{ "_id" : ObjectId("53603264fe4dce73df3d5038"), "location" : { "type" : "Point", "coordinates" : [  -118.49577606424951,  34.023187411291026 ] } }
{ "_id" : ObjectId("53603264fe4dce73df3d5039"), "location" : { "type" : "Point", "coordinates" : [  -118.49541242379013,  34.01807833239272 ] } }
{ "_id" : ObjectId("53603264fe4dce73df3d503a"), "location" : { "type" : "Point", "coordinates" : [  -118.49369484599256,  34.01034278843283 ] } }
> var t = new ISODate(); printjson( db.objects.find({ "location": { "$near": { "$geometry": { "type": "Point", "coordinates": [-118.491356, 34.02444] }, "$maxDistance": 10000 } } }).limit(1000).explain() ); print(new ISODate() - t);
{
    "cursor" : "S2NearCursor",
    "isMultiKey" : true,
    "n" : 1000,
    "nscannedObjects" : 1000,
    "nscanned" : 7996423,
    "nscannedObjectsAllPlans" : 1000,
    "nscannedAllPlans" : 7996423,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 7,
    "nChunkSkips" : 0,
    "millis" : 41190,
    "indexBounds" : {
 
    },
    "nscanned" : 7996423,
    "matchTested" : NumberLong(7996423),
    "geoMatchTested" : NumberLong(7996423),
    "numShells" : NumberLong(1),
    "keyGeoSkip" : NumberLong(0),
    "returnSkip" : NumberLong(3577),
    "btreeDups" : NumberLong(0),
    "inAnnulusTested" : NumberLong(7996423),
    "server" : "enter.local:27017"
}
41595
 
> var t = new ISODate(); printjson( db.objects.find({ "location.coordinates": { "$near": [-118.491356, 34.02444], "$maxDistance": 0.00156 } }).limit(1000).explain() ); print(new ISODate() - t);
{
    "cursor" : "GeoSearchCursor",
    "isMultiKey" : false,
    "n" : 1000,
    "nscannedObjects" : 1000,
    "nscanned" : 1000,
    "nscannedObjectsAllPlans" : 1000,
    "nscannedAllPlans" : 1000,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
 
    },
    "server" : "enter.local:27017"
}
12

However, when changing the data distribution and spread the points further apart, for example increasing the window to +/- 0.5 (instead of 0.01), and repeating the queries, both run really fast and there is almost no difference between the two indexes.

Does your dataset contain a large number of points that are really close to each other?

Another thing I want to point out: The $maxDistance value is treated differently for the different indexes / coordinate formats. For the 2d index and the legacy coordinate pair, the $maxDistance is measured in radians, not meters. So you'd have to change that value for your tests. However, I adjusted for that, and even removed it entirely, and still saw the discrepancy in running time.

I'm going to follow up with one of our Geo engineers to find out the reason for this behavior and. In the mean time, if you can share any extra information about your data set or distribution, that would be very helpful.

Regards,
Thomas

Comment by Abraham Lopez [ 28/Apr/14 ]

Hi Thomas,

Unfortunately, that's not the real cause nor the solution.

If you remove the explain() from the queries I included in my examples, you'll notice the MongoDB console takes the exact amount of time to respond that explain() is reporting, so the queries are indeed running slowly. Also, even if you run them with the explain() you'll find they are still as slow as without it.

I actually found out this issue when I was developing a Node.js API that connected to MongoDB (using the native driver with MongoJS, no Mongoose) and found that the API was ridiculously slow to respond. So, I ran the query directly in the MongoDB console and confirmed MongoDB was the bottleneck.

Can you try running the queries I mentioned but without the explain() so you can see what I mean?

This is a very weird issue, which I believe should be fixed, as I'm having to use 2D indexes rather than the 2DSphere ones.

Comment by Thomas Rueckstiess [ 25/Apr/14 ]

Hi Abraham,

In 2.4.x, the explain output of GeoSearchCursors was not returning correct results, see SERVER-12231, in particular for the "millis" and "nscanned" fields. To get correct stats for the query, you can run the geoNear command instead, for example for the case of a 2d index:

db.runCommand({geoNear: "objects", near: [-118.491356, 34.02444] })['stats']

You should see that it had to scan the same large number of matches as your 2dsphere query, and that it took longer than 0ms. You can also wrap the query for the 2d index in

var t = new ISODate(); printjson(db.objects.find( {location: {$near: [-118.491356, 34.02444] } }).explain()); print(new ISODate() - t);

which will measure and print out the time in seconds the query took in wall time.

This was a display issue and was fixed for version 2.6.0.

Regards,
Thomas

Generated at Thu Feb 08 03:32:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.