[SERVER-58131] Push down $lookup through $_internalUnpackBucket Created: 28/Jun/21  Updated: 27/Dec/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Percy Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 0
Labels: qi-timeseries
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-42337 Use better join algorithms for $lookup Backlog
Assigned Teams:
Query Integration
Participants:

 Description   

With time-series data, you might want to store some metadata in separate collection. For example if each event is associated with a site (a location), you could have just the site name in each event, and the site details in separate collection:

> db.events.find()
{ ... meta: { site: 'A' }, ... }
{ ... meta: { site: 'B' }, ... }
...
 
> db.sites.find()
{ _id: 'A', ... }
{ _id: 'B', ... }

You would query it by doing a $lookup to pull the details into each event:

db.events.aggregate([
  {$lookup: {
    from: 'sites', 
    localField: 'meta.site', 
    foreignField: '_id', 
    as: 'meta.site_details',
  }}
])

Since this $lookup only reads and writes metadata fields, we should be able to execute it once per bucket, before unpacking:

db.system.buckets.events.aggregate([
  {$lookup: {
    from: 'sites', 
    localField: 'meta.site', 
    foreignField: '_id', 
    as: 'meta.site_details',
  }},
  {$_internalUnpackBucket: {metaField: 'meta'}},
])



 Comments   
Comment by David Percy [ 28/Jun/21 ]

We could do the same optimization with the let/pipeline syntax:

db.events.aggregate([
  {$lookup: {
    from: 'sites', 
    let: {site_name: "$site"},
    pipeline: [{$match: {$expr: {$eq: ["$_id", "$$site_name"]}}}],
    as: 'meta.site_details',
  }}
])

The let and as only reference metadata fields, and the pipeline has no side effects ($rand or $sample: SERVER-49024).

Comment by David Percy [ 28/Jun/21 ]

You might also hope that this would be pushed down:

db.events.aggregate([
  {$lookup: {
    from: 'sites', 
    localField: 'meta.site', 
    foreignField: '_id', 
    as: 'site_details',
  }}
])

The only difference is as: 'site_details' instead of as: 'meta.site_details'. The problem is we need to put the result of the lookup somewhere in 'meta', because that's the format $_internalUnpackBucket understands. We could make up a temporary name:

db.system.buckets.events.aggregate([
  {$lookup: {
    from: 'sites', 
    localField: 'meta.site', 
    foreignField: '_id', 
    as: 'meta.tmp12345',
  }},
  {$_internalUnpackBucket: {metaField: 'meta'}},
  {$set: {site_details: "$meta.tmp12345"}},
])

Generated at Thu Feb 08 05:43:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.