[SERVER-29159] Allow "from" collection of $lookup to be sharded Created: 12/May/17  Updated: 06/Dec/22  Resolved: 30/Sep/21

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Query Optimization
Resolution: Duplicate Votes: 60
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-38830 Support sharded $lookup 'let' variabl... Closed
is depended on by SERVER-34935 Support cross-database lookup Backlog
is depended on by SERVER-28705 Add optimization to execute $lookup o... Closed
is depended on by SERVER-27496 allow self-$lookup on shard key value... Backlog
Gantt Dependency
has to be done before SERVER-28705 Add optimization to execute $lookup o... Closed
Related
related to SERVER-60360 Complete TODO listed in SERVER-29159 Closed
is related to SERVER-27533 Allow "from" collection of $graphLook... Closed
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   
Issue Status as of Mar 21, 2019

Summary

The $lookup aggregation stage allows for collection join across unsharded collections or from a sharded collection to an unsharded one. It does not allow for the "from" collection to be sharded. We understand this is a painful and unfortunate limit on the capabilities of the query language. We strive to make it so that the distribution of data does not impact the experience with the database, but are unable to implement this improvement to our satisfaction at this time. In order to implement this feature in a way that delivers value, we would need to either (a) substantially improve the query planner's ability to provide the best cluster-wide plan for join-style queries like those involving $lookup stages or (b) improve our ability to limit resource consumption in a sharded environment. Without either of those, we would have to implement the feature in a way that guarantees poor performance as the data size scales up.

In More Detail

After partially-implementing this feature, the query team found that our infrastructure is unable to choose a good execution plan for a query where the foreign collection of a $lookup is sharded. Because the current system lacks any way to predict how much matching data will be contributed from each shard, we must make guesses at the best execution plan. Such heuristics would often choose a plan which would shuffle a lot of data around the cluster and degrade performance for other clients. Moreover, more complex or even malicious queries involving many $lookups or deeply-nested $lookups could induce enough load to bring the cluster to a halt. For example, imagine an aggregation like the following:

db.sharded.aggregate([
  {$lookup: {
    from: 'sharded',
    pipeline: [
      {$lookup: {
        from: 'sharded',
        pipeline: [
          {$lookup: {
            from: 'sharded',
            pipeline: …
          }}]
      }}]
  }}
])

One correct implementation would be to have a single process (maybe a mongos) perform the entire pipeline, pulling results from each shard as it needs them. Such an implementation would clearly scale very poorly, and induce many unnecessary network round-trips. If you instead imagine an implementation which sends the query to execute in parallel on each shard, it might scale up better. But then such a query could exponentially explode the number of connections across the cluster by having each shard send a sub-pipeline to each other shard, then have that sub-pipeline send another sub-pipeline to each other shard, and so on. This is obviously a contrived example, but even relatively simple-looking queries can quickly eat up a lot of the cluster's resources in short order in this way.

After exposing such complexities in the design, the query team decided we will need to expand our distributed planning and execution infrastructure to implement this feature well. We understand this is a very desirable future and plan to work towards it in the future, but have no specific target date or release at this time.

Known Workarounds

  1. The source collection of an aggregation is allowed to be sharded, even if there's a $lookup to an unsharded namespace. So if for example you wanted to write
    db.unsharded.aggregate([{$lookup: {from: 'sharded', localField: 'unshardedId', foreignField: 'shardedId', as: 'x'}}])
    You could instead write something more like
    db.sharded.aggregate([{$lookup: {from: 'unsharded', localField: 'shardedId', foreignField: 'unshardedId', as: 'x'}}])
  2. As always, the client can perform the lookups themselves to get similar functionality at a higher performance cost.
  3. In certain cases where querying via a $lookup is common, using a different schema to model the relationship between documents may improve performance and remove the need for a $lookup. See our documentation about data modeling for some suggestions.


 Comments   
Comment by Katherine Wu (Inactive) [ 30/Sep/21 ]

Closing this ticket, support for sharded "from" collections in $lookup and $graphLookup will be part of the upcoming 5.1 release.

Comment by Oliver Weng [ 19/Jul/21 ]

same here, this is a bottle neck for scaling.

 

We have to migrate all the lookup into an simple query and sometimes, it's just needed to be as part of aggregation stage, in this case, we had to run aggregate on sharded collection and loop up unsharded collection and then group  it out etc, it's adding unecessary complexity. 

 

It will be a life saver if we have this feature.

Comment by Daniel Connelly [ 21/Jun/21 ]

I and my engineering team are willing to put in work to develop this feature (with the hope it works/expedites progress towards SERVER-27533). Please contact me at dconnelly@strongboxdata.com if this offer would be helpful.

Comment by Anoosh C Nayak [ 04/Apr/21 ]

Same here ..waiting for this feature. Scaling becomes difficult by not sharding the collections

Comment by Abolfazl Ziaratban [ 10/Feb/21 ]

this really is a deficiency in MongoDB. Without this feature, the performance will decrease very much in high volumes.

Comment by Asya Kamsky [ 19/Jul/19 ]

victor.gomez@lansweeper.com I'm afraid due to some dependencies this feature is at least a couple of releases away, however depending on how many client ids you have, would it be possible for you to instead split each client into an (unsharded) database? Then you'd be able to use $lookup if necessary.

Comment by Victor Gomez [ 11/Jul/19 ]

we are evaluating mongo atlas shard cluster for our SaaS product version but this feature is blocking the use of Mongo as a core database of the system. Do you have any plan or update about this feature? Our case, we are sharding the data by client id, so mean the lookup data will always be in the same shard, we don't need multishard lookup feature.

 

 

Comment by Mahesh Vaghela [ 07/Jan/19 ]

$lookup is very much required for our application use case also. Could you please update any plan to include this for sharded collection in near future? 

Comment by Gerry Brady [ 26/Oct/18 ]

I would like to add my voice this. $lookup is powerful utility and we very much would like to make it part of the toolset we use. That being said, if it cannot work on sharded collections, it provides little value as a long term solution for us.

It would be a great asset for us to at least know if it is in your plans, and if so approximately when you would expect it to implemented.

Thanks

Comment by Timothy Masters [ 08/Feb/18 ]

This feature is one that needs to be implemented in the next release. $lookup is something we need to utilize and has forced us to unshard some of our collections, forcing us to sacrifice scalability. To be useful in a production environment $lookup needs to support sharded collections. Please give this high priority.

Comment by Michael Ahlijah [ 05/Feb/18 ]

Hoping this is given a high priority in the next release.

I have had to un-shard some of my collections in order to use the $lookup feature in the aggregation pipeline. Obviously we lose the scalability that is a major reason for switching to mongodb. The data is going to grow large very soon so hoping for this to be address soon enough.

Comment by Ankur kalavadia [ 11/Oct/17 ]

$lookup is very powerful feature and it has to support sharded collection.

Any plan to include sharded collection into $lookup ?

Comment by Micha? Czernecki [ 04/Jul/17 ]

$graphLookup in 3.4 (SERVER-27533) and $lookup in 3.2 (SERVER-29159) are great new things in MongoDB, but restriction to non-sharded collections makes it useless in practice.
Add sharded collections support to $graphQuery or at least to $lookup as soon as possible please, it is crucial to make it useful in real production environments.
Best Regards.

Generated at Thu Feb 08 04:20:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.