[SERVER-9064] Count of Geo Querys of Millions of Documents Extremely Slow Created: 21/Mar/13 Updated: 10/Dec/14 Resolved: 02/May/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Geo |
| Affects Version/s: | 2.2.0, 2.4.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Critical - P2 |
| Reporter: | Matthew Cross | Assignee: | hari.khalsa@10gen.com |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux (EC2) |
||
| Participants: |
| Description |
|
I have a 5M record dataset including Geo locations upon which I want to search to find points within a selected area. The operation yields results quickly but I need a count of the points in the area as well. If I make a selection of a city or entire state on this data the count query takes 45 seconds or more to execute. There is a similar problem if I want to sort the results of a find (NOT a count in this case) regardless of indexing. This thread has additional info |
| Comments |
| Comment by Matthew Cross [ 08/May/13 ] |
|
The problem here is that both the 2d and the 2dspherical indexes don't allow me to do counts within polygon in a way that lets my app work. The data used for these queries is used for most of my webapp and due to mongo's locking characteristics, a few large geo queries will bring my site down. Even for moderate numbers of countable points within a polygon, query times run up to 10 seconds. When you do enough of those queries it's game over for my app. I've tried enabling slaveOK and might try sharding but in the end I'm probably going to have to use a different store because of the 10x better performance. I would really like to see this be a priority for you guys because this single factor is going to prevent me from using mongo at all and I can't be alone on that count. I'm open to other suggestions for how I might manage my data to facilitate these queries. I can't pre-compute the counts because they are based on live weather patterns overlaid upon location data that can also change. Postgres and ElasticSearch would be a nice workaround but the nature of our queries is such that I can't use one of these alternatives without using it for everything. I have even considered doing a MongoDB follow on $in query with the millions of IDs that come out of Postgres or ES but that just sounds insane. |
| Comment by hari.khalsa@10gen.com [ 02/May/13 ] |
|
Hi! Sorry for not getting back to this sooner. The 2dsphere isn't designed for highly efficient counting. Without an index format change or a redesign I think it's not going to perform well for counting large amounts of documents. |
| Comment by Matthew Cross [ 04/Apr/13 ] |
|
I will revisit my results from the user group thread but I was mostly referring to the much faster query using the $within->$polygon with the standard "2d" index. |
| Comment by Matthew Cross [ 04/Apr/13 ] |
|
Also note that I cannot hint the query to use the index. db.contact.find(/.../).hint( {"address.gisLocation":1}); db.contact.getIndexes() , , |
| Comment by hari.khalsa@10gen.com [ 04/Apr/13 ] |
|
Hi. Here is the original query you reported: > db.contact2.find({geo_2: { $within: }, If you compare this to one of your recently-reported queries, you'll see the run time actually got smaller: millis was 83783, now is 66435. So, I'm not sure how your results are worse than before. Can you elaborate? PS. Any time you see 'S2Cursor' it's using the 2dsphere index. |
| Comment by Matthew Cross [ 04/Apr/13 ] |
|
Tried again, first dropping all indexes and adding a single 2dsphere index on address.gisLocation. Also changed my query to $geoWithin. Results were marginally slower. real 1m18.714s { }, |
| Comment by Matthew Cross [ 04/Apr/13 ] |
|
I downloaded a 2.5 nightly build which has this change listed in the release notes. My results are worse than before. Should I be seeing an improvement? db.contact.find({"address.gisLocation": { $within: time to run: real 1m3.799s Here is the explain (note I don't know how to explain a count). Looks like it didn't use the 2dsphere index at all. { }, db.contact.find({"address.gisLocation" : {"$within" : { "$polygon" : [ [-88.352051,30.977609], [-81.320801,30.694612], [-79.431152,25.383735], [-81.013184,24.487149], [-82.990723,27.741885], [-83.540039,29.401320], [-88.308105,30.126124] ] } } time to run: real 0m36.326s ... and here is the explain for this query. { , |
| Comment by auto [ 03/Apr/13 ] |
|
Author: {u'date': u'2013-04-01T16:03:14Z', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}Message: |
| Comment by Matthew Cross [ 22/Mar/13 ] |
|
This information is also in the link above but just as a point of comparison, Postgresql with PostGIS returns results in approximately 3-5s and ElasticSearch in approximately 2-4s so with the right index this should be a fast operation. |