Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-39238

Allow foreign $lookup pipeline split at DocumentSourceSequentialDocumentCache

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization

      As of SERVER-32308, it is possible for a $lookup sub-pipeline to be run remotely, either in totality or split with local and remote parts. In the case of a pipeline with a non-correlated pipeline prefix, we will only cache that prefix if the pipeline is split and the cache lives on the local (merging) part of the pipeline.

      We could optimize for the caching of non-correlated pipeline prefix by splitting at the DocumentSourceSequentialDocumentCache stage in cases where we would otherwise not split or would split prior to this stage. There are 2 options for this:

      First, a limited approach that would only split if the DocumentSourceSequentialDocumentCache stage could cache the entire subpipeline, which is uncorrelated by definition. In this case, attempting to cache will not result in additional network traffic, and may reduce traffic if the foreign data size is smaller than the maximum cache size of 100MB.

      The second option, would be a more generalized approach where we always split at the DocumentSourceSequentialDocumentCache stage when there is a non-correlated pipeline prefix that would not otherwise be part of the local pipeline. In this case, there are a few things to consider:

      1. It is possible that the foreign pipeline has a non-correlated prefix followed by a correlated stage that significantly reduces the size of the result set. In this case we could be sending a large volume of data across shards as compared to running the entire pipeline remotely (for every local document) and sending a small volume of data.
      2. The above is mitigated by the fact that we would only retrieve for the first local document and would then use the cache from then on, given we don't exceed the cache limit. If the 100MB cache limit was exceeded, we could rewrite the pipeline to execute entirely on the remote shard. We could abort any foreign pipeline execution if it reached 100MB (to prevent sending an unbounded amount of data across the wire) and run the entire subpipeline remotely.

            Assignee:
            backlog-query-optimization [DO NOT USE] Backlog - Query Optimization
            Reporter:
            james.wahlin@mongodb.com James Wahlin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: