Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Query Optimization
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This problem only affects the Bonsai optimizer. The $lookup implementation in Bonsai is known to be incomplete, so I've chosen to classify this ticket as "Task" rather than "Bug". I'll demonstrate the problem by example. Suppose we have the following two collections:

MongoDB Enterprise > db.c.find()
{ "_id" : ObjectId("62e1988ae2bd597ae6a53976"), "a" : 1 }
{ "_id" : ObjectId("62e1988be2bd597ae6a53978"), "a" : 1 }
MongoDB Enterprise > db.d.find()
{ "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 }

Consider the following query, executed using the classic engine:

MongoDB Enterprise > db.c.aggregate([{$project: {_id: 0}}, {$lookup: {from: "d", as: "outField", localField: "a", foreignField: "b"}}], {allowDiskUse: false})
{ "a" : 1, "outField" : [ { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 } ] }
{ "a" : 1, "outField" : [ { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 } ] }

After enabling the Bonsai optimizer, the same query returns the wrong results:

MongoDB Enterprise > db.c.aggregate([{$project: {_id: 0}}, {$lookup: {from: "d", as: "outField", localField: "a", foreignField: "b"}}], {allowDiskUse: false})
{ "a" : 1, "outField" : [ { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 }, { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 } ] }

The problem is due to a subtly incorrect SBE plan. The plan looks like this:

[8] project [s9 = if (exists (s8) || isObject (s7), setField (s7, "outField", s8), s7)]
[7] group [s3] [s7 = first (s2), s8 = addToArray (s5)]
[6] nlj [s2, s3, s4] [] {traverseF (s6, \ l101.0 . if (isArray (s4), isMember (l101.0, s4), l101.0 == s4), false)}
    left
        [4] project [s4 = fillEmpty (getField (s2, "a"), null)]
        [2] project [s3 = getField (s2, "_id")]
        [1] project [s2 = if (isObject (s1), dropFields (s1, "_id"), s1)]
        [0] scan s1 none none none none none [] @"39177734-b8f0-4357-98cf-3b38bbd9c9e9" true false
    right
        [5] scan s5 none none none none none [s6 = b] @"c90930d0-ce88-491b-8e39-76b9a8c7ee1a" true false

The part of interest here is the group stage:

group [s3] [s7 = first (s2), s8 = addToArray (s5)]

Roughly speaking, this is taking all of the documents from the foreign side which are join partners with a document from the local side and sticking them into an array. The problem is that we are grouping by the contents of the document from the local side. If there are two identical documents on the local side (the situation that I have artificially constructed here), then join partners from multiple documents get incorrectly aggregated together.

Perhaps we could instead build a plan which generates a sequence number or an internal id associated with each unique document from the local side. Then we could solve the problem by grouping on this id rather than grouping on the full contents of the document.

Assignee:: [DO NOT USE] Backlog - Query Optimization
Reporter:: David Storch
Participants:: [DO NOT USE] Backlog - Query Optimization, Ana Meza, David Storch
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jul 27 2022 08:19:37 PM UTC
Updated:: Nov 29 2024 03:44:56 PM UTC
Resolved:: Nov 29 2024 03:44:55 PM UTC

Details

Description

Attachments

Activity

People

Dates