Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-79548

[SBE] optimize pulling multiple fields out of a large document

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Query Execution
    • QE 2023-08-21, QE 2023-09-04, QE 2023-09-18, QE 2023-10-02, QE 2023-10-16, QE 2023-10-30

      in this group query we are grouping several fields with a common subpath

      db.a.explain().aggregate({$match: {"magic.position": "air"}}, {$group: {_id: {a: "$foo.b.c", d: "$foo.e.f", g: "$foo.h.i"}}}).queryPlanner.winningPlan.slotBasedPlan.stages
      [2] mkbson s13 [_id = s12] true false
      [2] project [s12 = newObj("a", s9, "d", s10, "g", s11)]
      [2] group [s9, s10, s11] []
      [2] project [s9 = traverseP(s4, lambda(l101.0) { traverseP(getField(move(l101.0), "b"), lambda(l102.0) { getField(move(l102.0), "c") }, 1) }, 1), s10 = traverseP(s4, lambda(l101.0) { traverseP(getField(move(l101.0), "e"), lambda(l102.0) { getField(move(l102.0), "f") }, 1) }, 1), s11 = traverseP(s4, lambda(l101.0) { traverseP(getField(move(l101.0), "h"), lambda(l102.0) { getField(move(l102.0), "i") }, 1) }, 1)]
      [1] filter {traverseF(s5, lambda(l1.0) { traverseF(getField(l1.0, "position"), lambda(l2.0) { ((l2.0 == s8) ?: false) }, false) }, false)}
      [1] scan s6 s7 none none none none lowPriority [s4 = foo, s5 = magic] @"5ddad002-6f31-4499-a881-8cb09547796e" true false 

      each field on the sub-object "foo" has their own traverse. Even worse if there was a common subpath after foo (eg "foo.bar.a", "foo.bar.b", "foo.bar.c") we traverse for the bar field multiple times, when it could just be done once. These n^2 lookups become more painful when documents are large and there are many paths to lookup.

      One solution is to try to generate plans that look more like this projection query.

      db.a.explain().aggregate({$project: {"foo.b.c": 1, "foo.e.f": 1, "foo.h.i": 1, _id: 0}}).queryPlanner.winningPlan.slotBasedPlan.stages
      [2] project [s6 = traverseP(s4, lambda(l1.0) {
          if isObject(l1.0)
          then makeBsonObj(MakeObjSpec(keep, [], ["foo"]), l1.0, traverseP(getField(l1.0, "foo"), lambda(l2.0) {
              if isObject(l2.0)
              then makeBsonObj(MakeObjSpec(keep, [], ["b", "e", "h"]), l2.0, traverseP(getField(l2.0, "b"), lambda(l3.0) {
                  if isObject(l3.0)
                  then makeBsonObj(MakeObjSpec(keep, ["c"], []), l3.0)
                  else Nothing
              }, Nothing), traverseP(getField(l2.0, "e"), lambda(l4.0) {
                  if isObject(l4.0)
                  then makeBsonObj(MakeObjSpec(keep, ["f"], []), l4.0)
                  else Nothing
              }, Nothing), traverseP(getField(l2.0, "h"), lambda(l5.0) {
                  if isObject(l5.0)
                  then makeBsonObj(MakeObjSpec(keep, ["i"], []), l5.0)
                  else Nothing
              }, Nothing))
              else Nothing
          }, Nothing))
          else Nothing
      }, Nothing)]
      [1] scan s4 s5 none none none none lowPriority [] @"5ddad002-6f31-4499-a881-8cb09547796e" true false 

      this still however has a similar problem where each call to getField is done independently and can result in the full bson or subobject in the bson being scanned multiple times.

      there is logic in the classic engine here to avoid scanning a document multiple times if possible while doing an inclusion projection https://github.com/mongodb/mongo/blob/2e0259b3050e4c27d47e353222395d21bb80b9e4/src/mongo/db/exec/fastpath_projection_node.h#L88

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            mickey.winters@mongodb.com Mickey Winters
            Votes:
            2 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated: