Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-68369

Bonsai generates an incorrect plan for a $lookup query using an NLJ strategy

    XMLWordPrintableJSON

Details

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Major - P3 Major - P3
    • None
    • None
    • None
    • None
    • Query Optimization

    Description

      This problem only affects the Bonsai optimizer. The $lookup implementation in Bonsai is known to be incomplete, so I've chosen to classify this ticket as "Task" rather than "Bug". I'll demonstrate the problem by example. Suppose we have the following two collections:

      MongoDB Enterprise > db.c.find()
      { "_id" : ObjectId("62e1988ae2bd597ae6a53976"), "a" : 1 }
      { "_id" : ObjectId("62e1988be2bd597ae6a53978"), "a" : 1 }
      MongoDB Enterprise > db.d.find()
      { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 }
      

      Consider the following query, executed using the classic engine:

      MongoDB Enterprise > db.c.aggregate([{$project: {_id: 0}}, {$lookup: {from: "d", as: "outField", localField: "a", foreignField: "b"}}], {allowDiskUse: false})
      { "a" : 1, "outField" : [ { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 } ] }
      { "a" : 1, "outField" : [ { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 } ] }
      

      After enabling the Bonsai optimizer, the same query returns the wrong results:

      MongoDB Enterprise > db.c.aggregate([{$project: {_id: 0}}, {$lookup: {from: "d", as: "outField", localField: "a", foreignField: "b"}}], {allowDiskUse: false})
      { "a" : 1, "outField" : [ { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 }, { "_id" : ObjectId("62e19305f999d953c3255b2c"), "b" : 1 } ] }
      

      The problem is due to a subtly incorrect SBE plan. The plan looks like this:

      [8] project [s9 = if (exists (s8) || isObject (s7), setField (s7, "outField", s8), s7)]
      [7] group [s3] [s7 = first (s2), s8 = addToArray (s5)]
      [6] nlj [s2, s3, s4] [] {traverseF (s6, \ l101.0 . if (isArray (s4), isMember (l101.0, s4), l101.0 == s4), false)}
          left
              [4] project [s4 = fillEmpty (getField (s2, "a"), null)]
              [2] project [s3 = getField (s2, "_id")]
              [1] project [s2 = if (isObject (s1), dropFields (s1, "_id"), s1)]
              [0] scan s1 none none none none none [] @"39177734-b8f0-4357-98cf-3b38bbd9c9e9" true false
          right
              [5] scan s5 none none none none none [s6 = b] @"c90930d0-ce88-491b-8e39-76b9a8c7ee1a" true false
      

      The part of interest here is the group stage:

      group [s3] [s7 = first (s2), s8 = addToArray (s5)]
      

      Roughly speaking, this is taking all of the documents from the foreign side which are join partners with a document from the local side and sticking them into an array. The problem is that we are grouping by the contents of the document from the local side. If there are two identical documents on the local side (the situation that I have artificially constructed here), then join partners from multiple documents get incorrectly aggregated together.

      Perhaps we could instead build a plan which generates a sequence number or an internal id associated with each unique document from the local side. Then we could solve the problem by grouping on this id rather than grouping on the full contents of the document.

      Attachments

        Activity

          People

            backlog-query-optimization Backlog - Query Optimization
            david.storch@mongodb.com David Storch
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: