[SERVER-33588] Support global secondary indexes in sharded clusters Created: 01/Mar/18  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Index Maintenance, Sharding
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: David Bartley Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Cluster Scalability
Participants:

 Description   

If you have a collection containing "foo" and "bar", it could be the case that you need to make queries against both "foo" and "bar". You could configure a sharded collection to range shard over "foo", which would make any queries against "foo" only hit 1 replset shard, but any query against "bar" would require fanning out to every replset, which can adversely impact load/latency/availability if you have many replset shards.

An approach to solving this would be to have global secondary indexes (GSI; c.f. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html), where you store the keys you want to index ("foo" and "bar" here), along with the underlying _id of the document, or possibly the entire document. These collections would be range sharded (by "foo"+"bar"), so that a query to a GSI would tend to only hit 1 replset initially. If you chose to store the entire document in the GSI, you'd be done. If you instead stored just _id, you'd need to query the underlying collection. If you end up returning 100's of documents you probably end up hitting all replset shards as before, but for the nReturned=0 or nReturned=1 case you'd only hit a single additional replset.

Initially, you'd probably want GSIs to be eventually consistent, though with MongoDB's new transaction support you could conceivably make them more strongly consistent.



 Comments   
Comment by Andy Schwerin [ 28/Jan/19 ]

It's still on the backlog waiting for scheduling. I'm pretty interested in the project, but I can't promise we'll have a chance to do it in 2019. It's definitely not going to make 4.2.

Comment by David Bartley [ 26/Jan/19 ]

Any update on this?

Comment by Ramon Fernandez Marina [ 05/Mar/18 ]

Thanks for your report bartle. Sending it to the sharding team for consideration.

Comment by David Bartley [ 01/Mar/18 ]

Sorry, you'd want to range shard by "bar" in your GSI so that you can efficiently service queries on "bar" (you'd range shard the underlying collection by "foo"). Or, you could have two GSIs, on "foo", and on "bar", and then hash shard your underlying collection by _id, if you tended to make a lot of _id lookups.

Generated at Thu Feb 08 04:33:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.