[SERVER-1982] fully support sharding on geo field Created: 20/Oct/10  Updated: 28/Dec/23

Status: Backlog
Project: Core Server
Component/s: Geo, Sharding
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Mathias Stearn Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 53
Labels: qi-geo
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Initiative
Related
related to SERVER-926 geo search sharding support Closed
Assigned Teams:
Query Integration
Participants:
Case:

 Description   

Currently "works" by routing geo queries to all shards. Correct results, but inefficient.



 Comments   
Comment by Xiaochen Wu [ 15/Dec/22 ]

linked to INIT-5 and send it back to QE backlog. kateryna.kamenieva@mongodb.com kyle.suarez@mongodb.com please review.

Comment by Michael Kalish [ 28/Jul/14 ]

Any update on when this might make it in? Seems like this would be vital for operating with multiple, geographically distant data centers.
As far as I can tell, the only real alternative is using the nearest when reading from secondaries, but this does not seem like a satisfactory solution.

Comment by Ron Waldon [ 20/Mar/13 ]

Here's a scenario:

  • you have a single MongoDB cluster than spans multiple data-centres across the Earth
  • you have application worker instances in each data-centre and geographic DNS routing
  • traffic to/from particular data-centres is expensive (e.g. South Africa) or slow (e.g. Singapore)
  • it is more important to limit network traffic to a particular zone than it is to optimise query time

I realise you could accomplish this with separate MongoDB clusters, but then you have to maintain application-level logic to know which configurations to use and when. But this is a slippery slope: after all, I can use application-level logic to get sharding to work with MySQL.

To me, the philosophy of MongoDB is to let the DB make intelligent choices about data storage and retrieval. This seems like an entirely reasonable request to increase the ability for MongoDB to make intelligent choices.

Comment by Scott Lowe [ 30/Mar/12 ]

I was exited to see the geo indexing capabilities of mongo, but was dissapointed that this has not been properly implemented with regards to sharding. IMO this makes geo data a second class citizen in mongo. Are there plans for this to be implemented?

Comment by Leon van Zantvoort [ 01/Mar/12 ]

@Eliot I don't have any performance issues at this moment. I was just checking the (theoretical) scalability properties of geo/sharding in case I do need to scale.

Thanks for the response anyway!

Comment by Eliot Horowitz (Inactive) [ 01/Mar/12 ]

Yes - but its unclear if that's good or bad in this case.
Have you tried it and seen performance issues?

Comment by Leon van Zantvoort [ 01/Mar/12 ]

You describe the benefits of sharding. I totally agree with you on those positive effects.

In general, my statement boils down to:

"Sharding on key 'A' and querying on (solely) key 'B' is not efficient. It is more efficient compared to a non-partitioned approach for the reasons you describe above, but fact remains that all shards are involved in a search if the shard key is not part of the query."

As long as sharding on geo key is not fully supported, it looks like Mongo behaves like this.

Comment by Mathias Stearn [ 29/Feb/12 ]

To be clear, you went from 30% to 20% load. Also, by spreading the data out you are able to keep more data in ram which is the #1 determiner of performance. Since having all data in ram is between 1000 (high-end SSD) to 1,000,000 (spinning disk) times faster so is much more important when it comes to scalability than reducing the processing that each server has to do.

Comment by Leon van Zantvoort [ 29/Feb/12 ]

Again, taking O(log n) as an example.

1000 documents, 1 shard would be:
Reponse time: 3
System load: 3

1000 documents equally distribibuted of 10 shards:
Response time: 2
System load: 20

I know there is more to it, but this is the point I'm trying to make.

Comment by Leon van Zantvoort [ 29/Feb/12 ]

First of all, thanks for being responsive. This is appreciated!

I'm not sure what the time complexity is of geo spatial queries, but for sake of simplicity say it's O(log n).

If I apply this to your suggestion, we would get the following:

Geo key is used for sharding: O(log n)
Alternative key is used for sharding (key is not part of search query): O(s log n/s) where s is number of shards and n is total number of documents.

This means that the more shards I add, the quicker each individual shard will find results (performance gain is sublinear), however this needs to be multiplied by the number of shards.

So from a response time perspective (single user system, weakest link not taken into account), you are right about performance improvement using multiple shards. But from a scalability perspective, this approach will add more and more load on the system as a whole if more shards are added.

Comment by Eliot Horowitz (Inactive) [ 29/Feb/12 ]

Sort of.
The problem is that if you search on a boundary, you still might be hitting many shards, so there is no guarantee you only hit 1 shard.

If one shard is handling it, then it may have to look at 1000 documents.
If you have 10 shards, then each shard looks at 100, and mongos merges.

Pros and cons, but I would test sharding by a different key as well.

Comment by Leon van Zantvoort [ 29/Feb/12 ]

Hi Eliot,

I would like to use the geo key for sharding, so I can lookup documents near a given location efficiently (and in a scalable way).

If I understand correctly, if sharding for geo keys would be fully supported, only a single shard is accessed for getting documents near a given location (maybe two shards in case the given location is close to a range boundary). For that reason, it would be a waste of resources, if other shards are accessed, as they don't have any documents near this location.

Thanks,
Leon

Comment by Eliot Horowitz (Inactive) [ 29/Feb/12 ]

A call to all shards scales pretty well, as each shard is doing 1/N the work.
So for large data sets, it can actually be considerably faster.

Comment by Ignacio Lago [ 29/Feb/12 ]

And it's critical since it is the common case scenario for a MongoDB setup.

As an example: Foursquare uses MongoDB mainly because Sharding and Geospatial Index. (#ref http://www.10gen.com/customers/foursquare)

It's ironic that both features aren't compatible yet. I mean: efficient and useful, not a call to all shards.

Comment by Leon van Zantvoort [ 29/Feb/12 ]

Hi Mathias,

Since this case is created over 16 months ago "Planning Bucket A" is not telling me if this will be picked up within the coming 16 months or later...

Leon

Comment by Mathias Stearn [ 29/Feb/12 ]

That is what this case is for.

Comment by Leon van Zantvoort [ 28/Feb/12 ]

(Sorry for duplicating this comment, but I think this issue is better suited for it...)

Are there any plans to support sharding on geo key?

As long as this isn't supported, geospatial searching doesn't scale with Mongo, or am I missing something here?

Generated at Thu Feb 08 02:58:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.