Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29159

Allow "from" collection of $lookup to be sharded

    • Type: Icon: Improvement Improvement
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Aggregation Framework
    • Labels:
      None
    • Query Optimization

      Issue Status as of Mar 21, 2019

      Summary

      The $lookup aggregation stage allows for collection join across unsharded collections or from a sharded collection to an unsharded one. It does not allow for the "from" collection to be sharded. We understand this is a painful and unfortunate limit on the capabilities of the query language. We strive to make it so that the distribution of data does not impact the experience with the database, but are unable to implement this improvement to our satisfaction at this time. In order to implement this feature in a way that delivers value, we would need to either (a) substantially improve the query planner's ability to provide the best cluster-wide plan for join-style queries like those involving $lookup stages or (b) improve our ability to limit resource consumption in a sharded environment. Without either of those, we would have to implement the feature in a way that guarantees poor performance as the data size scales up.

      In More Detail

      After partially-implementing this feature, the query team found that our infrastructure is unable to choose a good execution plan for a query where the foreign collection of a $lookup is sharded. Because the current system lacks any way to predict how much matching data will be contributed from each shard, we must make guesses at the best execution plan. Such heuristics would often choose a plan which would shuffle a lot of data around the cluster and degrade performance for other clients. Moreover, more complex or even malicious queries involving many $lookups or deeply-nested $lookups could induce enough load to bring the cluster to a halt. For example, imagine an aggregation like the following:

      db.sharded.aggregate([
        {$lookup: {
          from: 'sharded',
          pipeline: [
            {$lookup: {
              from: 'sharded',
              pipeline: [
                {$lookup: {
                  from: 'sharded',
                  pipeline: …
                }}]
            }}]
        }}
      ])
      

      One correct implementation would be to have a single process (maybe a mongos) perform the entire pipeline, pulling results from each shard as it needs them. Such an implementation would clearly scale very poorly, and induce many unnecessary network round-trips. If you instead imagine an implementation which sends the query to execute in parallel on each shard, it might scale up better. But then such a query could exponentially explode the number of connections across the cluster by having each shard send a sub-pipeline to each other shard, then have that sub-pipeline send another sub-pipeline to each other shard, and so on. This is obviously a contrived example, but even relatively simple-looking queries can quickly eat up a lot of the cluster's resources in short order in this way.

      After exposing such complexities in the design, the query team decided we will need to expand our distributed planning and execution infrastructure to implement this feature well. We understand this is a very desirable future and plan to work towards it in the future, but have no specific target date or release at this time.

      Known Workarounds

      1. The source collection of an aggregation is allowed to be sharded, even if there's a $lookup to an unsharded namespace. So if for example you wanted to write
        db.unsharded.aggregate([{$lookup: {from: 'sharded', localField: 'unshardedId', foreignField: 'shardedId', as: 'x'}}])
        You could instead write something more like
        db.sharded.aggregate([{$lookup: {from: 'unsharded', localField: 'shardedId', foreignField: 'unshardedId', as: 'x'}}])
      2. As always, the client can perform the lookups themselves to get similar functionality at a higher performance cost.
      3. In certain cases where querying via a $lookup is common, using a different schema to model the relationship between documents may improve performance and remove the need for a $lookup. See our documentation about data modeling for some suggestions.

            Votes:
            60 Vote for this issue
            Watchers:
            72 Start watching this issue

              Created:
              Updated:
              Resolved: