Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Aggregation Framework
Labels:
None

Assigned Teams:

Query Optimization
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Issue Status as of Mar 21, 2019

Summary

The $lookup aggregation stage allows for collection join across unsharded collections or from a sharded collection to an unsharded one. It does not allow for the "from" collection to be sharded. We understand this is a painful and unfortunate limit on the capabilities of the query language. We strive to make it so that the distribution of data does not impact the experience with the database, but are unable to implement this improvement to our satisfaction at this time. In order to implement this feature in a way that delivers value, we would need to either (a) substantially improve the query planner's ability to provide the best cluster-wide plan for join-style queries like those involving $lookup stages or (b) improve our ability to limit resource consumption in a sharded environment. Without either of those, we would have to implement the feature in a way that guarantees poor performance as the data size scales up.

In More Detail

After partially-implementing this feature, the query team found that our infrastructure is unable to choose a good execution plan for a query where the foreign collection of a $lookup is sharded. Because the current system lacks any way to predict how much matching data will be contributed from each shard, we must make guesses at the best execution plan. Such heuristics would often choose a plan which would shuffle a lot of data around the cluster and degrade performance for other clients. Moreover, more complex or even malicious queries involving many $lookups or deeply-nested $lookups could induce enough load to bring the cluster to a halt. For example, imagine an aggregation like the following:

db.sharded.aggregate([
  {$lookup: {
    from: 'sharded',
    pipeline: [
      {$lookup: {
        from: 'sharded',
        pipeline: [
          {$lookup: {
            from: 'sharded',
            pipeline: …
          }}]
      }}]
  }}
])

One correct implementation would be to have a single process (maybe a mongos) perform the entire pipeline, pulling results from each shard as it needs them. Such an implementation would clearly scale very poorly, and induce many unnecessary network round-trips. If you instead imagine an implementation which sends the query to execute in parallel on each shard, it might scale up better. But then such a query could exponentially explode the number of connections across the cluster by having each shard send a sub-pipeline to each other shard, then have that sub-pipeline send another sub-pipeline to each other shard, and so on. This is obviously a contrived example, but even relatively simple-looking queries can quickly eat up a lot of the cluster's resources in short order in this way.

After exposing such complexities in the design, the query team decided we will need to expand our distributed planning and execution infrastructure to implement this feature well. We understand this is a very desirable future and plan to work towards it in the future, but have no specific target date or release at this time.

Known Workarounds

The source collection of an aggregation is allowed to be sharded, even if there's a $lookup to an unsharded namespace. So if for example you wanted to write
db.unsharded.aggregate([{$lookup: {from: 'sharded', localField: 'unshardedId', foreignField: 'shardedId', as: 'x'}}])
You could instead write something more like
db.sharded.aggregate([{$lookup: {from: 'unsharded', localField: 'shardedId', foreignField: 'unshardedId', as: 'x'}}])
As always, the client can perform the lookups themselves to get similar functionality at a higher performance cost.
In certain cases where querying via a $lookup is common, using a different schema to model the relationship between documents may improve performance and remove the need for a $lookup. See our documentation about data modeling for some suggestions.

depends on

SERVER-38830 Support sharded $lookup 'let' variable serialization for shard to shard routing

Closed

has to be done before

SERVER-28705 Add optimization to execute $lookup on local shards when possible

Closed

is depended on by

SERVER-34935 Support cross-database lookup

Backlog

SERVER-28705 Add optimization to execute $lookup on local shards when possible

Closed

SERVER-27496 allow self-$lookup on shard key value equality pre-merge on a sharded collection

Backlog

is related to

SERVER-27533 Allow "from" collection of $graphLookup to be sharded

Closed

related to

SERVER-60360 Complete TODO listed in SERVER-29159

Closed

(1 is related to, 1 related to)

Assignee:: [DO NOT USE] Backlog - Query Optimization
Reporter:: Charlie Swanson
Participants:: [DO NOT USE] Backlog - Query Optimization, Abolfazl Ziaratban, Ankur kalavadia, Anoosh C Nayak, Asya Kamsky, Charlie Swanson, Daniel Connelly, Gerry Brady, Katherine Wu, Mahesh Vaghela, Micha? Czernecki, Michael Ahlijah, Oliver Weng, Timothy Masters, Victor Gomez
Votes:: 60 Vote for this issue
Watchers:: 72 Start watching this issue

Created:: May 12 2017 03:44:37 PM UTC
Updated:: Dec 06 2022 04:01:14 AM UTC
Resolved:: Sep 30 2021 06:21:09 PM UTC

Details

Description

Summary

In More Detail

Known Workarounds

Attachments

Issue Links

Forms

Activity

People

Dates