[SERVER-39238] Allow foreign $lookup pipeline split at DocumentSourceSequentialDocumentCache Created: 28/Jan/19  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: James Wahlin Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-32308 Add the ability for a $lookup stage t... Closed
Assigned Teams:
Query Optimization
Participants:

 Description   

As of SERVER-32308, it is possible for a $lookup sub-pipeline to be run remotely, either in totality or split with local and remote parts. In the case of a pipeline with a non-correlated pipeline prefix, we will only cache that prefix if the pipeline is split and the cache lives on the local (merging) part of the pipeline.

We could optimize for the caching of non-correlated pipeline prefix by splitting at the DocumentSourceSequentialDocumentCache stage in cases where we would otherwise not split or would split prior to this stage. There are 2 options for this:

First, a limited approach that would only split if the DocumentSourceSequentialDocumentCache stage could cache the entire subpipeline, which is uncorrelated by definition. In this case, attempting to cache will not result in additional network traffic, and may reduce traffic if the foreign data size is smaller than the maximum cache size of 100MB.

The second option, would be a more generalized approach where we always split at the DocumentSourceSequentialDocumentCache stage when there is a non-correlated pipeline prefix that would not otherwise be part of the local pipeline. In this case, there are a few things to consider:

  1. It is possible that the foreign pipeline has a non-correlated prefix followed by a correlated stage that significantly reduces the size of the result set. In this case we could be sending a large volume of data across shards as compared to running the entire pipeline remotely (for every local document) and sending a small volume of data.
  2. The above is mitigated by the fact that we would only retrieve for the first local document and would then use the cache from then on, given we don't exceed the cache limit. If the 100MB cache limit was exceeded, we could rewrite the pipeline to execute entirely on the remote shard. We could abort any foreign pipeline execution if it reached 100MB (to prevent sending an unbounded amount of data across the wire) and run the entire subpipeline remotely.

Generated at Thu Feb 08 04:51:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.