Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29647

Avoid moving $match to be before $sort + $limit

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4.6, 3.5.10
    • Component/s: Aggregation Framework
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v3.4
    • Sprint:
      Query 2017-07-10
    • Case:

      Description

      Background — $sort + $match

      When $match follows $sort it gets moved to be ahead of the $sort. This makes sense because it reduces the N in the O(NlogN) of the sort, ie. the aggregation:

      > db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                      { $sort: { b: 1 } },
                                      { $match: { c: 1 } } ] )
      

      can be more efficiently implementated (with identical results) by:

      > db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                      { $match: { c: 1 } },
                                      { $sort: { b: 1 } } ] )
      

      And indeed both of these aggregations have the same explain plan:

      > db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                      { $sort: { b: 1 } },
                                      { $match: { c: 1 } } ] )
      {
              "stages" : [
                      {
                              "$cursor" : {
                                      "query" : {
       
                                      },
                                      "fields" : {
                                              "a" : 1,
                                              "_id" : 0
                                      },
                                      "queryPlanner" : {
                                              "plannerVersion" : 1,
                                              "namespace" : "test.foo",
                                              "indexFilterSet" : false,
                                              "parsedQuery" : {
       
                                              },
                                              "winningPlan" : {
                                                      "stage" : "COLLSCAN",
                                                      "direction" : "forward"
                                              },
                                              "rejectedPlans" : [ ]
                                      }
                              }
                      },
                      {
                              "$group" : {
                                      "_id" : "$a"
                              }
                      },
                      {
                              "$match" : {
                                      "c" : 1
                              }
                      },
                      {
                              "$sort" : {
                                      "sortKey" : {
                                              "b" : 1
                                      }
                              }
                      }
              ],
              "ok" : 1
      }
      > db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                      { $match: { c: 1 } },
                                      { $sort: { b: 1 } } ] )
      {
              "stages" : [
                      {
                              "$cursor" : {
                                      "query" : {
       
                                      },
                                      "fields" : {
                                              "a" : 1,
                                              "_id" : 0
                                      },
                                      "queryPlanner" : {
                                              "plannerVersion" : 1,
                                              "namespace" : "test.foo",
                                              "indexFilterSet" : false,
                                              "parsedQuery" : {
       
                                              },
                                              "winningPlan" : {
                                                      "stage" : "COLLSCAN",
                                                      "direction" : "forward"
                                              },
                                              "rejectedPlans" : [ ]
                                      }
                              }
                      },
                      {
                              "$group" : {
                                      "_id" : "$a"
                              }
                      },
                      {
                              "$match" : {
                                      "c" : 1
                              }
                      },
                      {
                              "$sort" : {
                                      "sortKey" : {
                                              "b" : 1
                                      }
                              }
                      }
              ],
              "ok" : 1
      }
      

      Background — $sort + $limit

      When $sort is followed by $limit (a common use case), the $limit gets pushed into the $sort stage. This allows a more efficient top-k algorithm to be used, which is usually O(Nlogk) (which reduces to O(N) if k is small) — compared to O(NlogN) to naively sort the documents and then apply the limit.

      eg. in the following aggregation, the maximum document output by $group can be found after a single pass:

      > db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                      { $sort: { b: 1 } }, { $limit: 1 } ] )
      {
              "stages" : [
                      {
                              "$cursor" : {
                                      "query" : {
       
                                      },
                                      "fields" : {
                                              "a" : 1,
                                              "_id" : 0
                                      },
                                      "queryPlanner" : {
                                              "plannerVersion" : 1,
                                              "namespace" : "test.foo",
                                              "indexFilterSet" : false,
                                              "parsedQuery" : {
       
                                              },
                                              "winningPlan" : {
                                                      "stage" : "COLLSCAN",
                                                      "direction" : "forward"
                                              },
                                              "rejectedPlans" : [ ]
                                      }
                              }
                      },
                      {
                              "$group" : {
                                      "_id" : "$a"
                              }
                      },
                      {
                              "$sort" : {
                                      "sortKey" : {
                                              "b" : 1
                                      },
                                      "limit" : NumberLong(1)
                              }
                      }
              ],
              "ok" : 1
      }
      

      Problem — $sort + $limit + $match

      Now consider when a $match occurs after $sort + $limit. This table shows the number of documents processed after the $group stage, with a $limit of 1, for the best case ($match matches no documents) and worst case ($match matches all documents):

      aggregation best case worst case

      [ { $group: { _id: '$a' } },
        { $match: { c: 1 } },
        { $sort: { b: 1 } }, { $limit: 1 } ]

      N 2N + 1

      [ { $group: { _id: '$a' } },
        { $sort: { b: 1 } }, { $limit: 1 },
        { $match: { c: 1 } } ]

      N N + 2

      This shows that for $sort + $limit: 1 it is better (and never worse) for the $match to appear afterwards. For larger k, this may not be the case, and for k = N, the problem degenerates to sorting without a limit (and hence the $match is better off before the $sort).

      Unfortunately, the behaviour of the server is to always move $match ahead of the $sort + $limit (even though sometimes this is worse):

      > db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
      ... { $sort: { b: 1 } }, { $limit: 1 },
      ... { $match: { c: 1 } } ] )
      {
              "stages" : [
                      {
                              "$cursor" : {
                                      "query" : {
       
                                      },
                                      "fields" : {
                                              "a" : 1,
                                              "_id" : 0
                                      },
                                      "queryPlanner" : {
                                              "plannerVersion" : 1,
                                              "namespace" : "test.foo",
                                              "indexFilterSet" : false,
                                              "parsedQuery" : {
       
                                              },
                                              "winningPlan" : {
                                                      "stage" : "COLLSCAN",
                                                      "direction" : "forward"
                                              },
                                              "rejectedPlans" : [ ]
                                      }
                              }
                      },
                      {
                              "$group" : {
                                      "_id" : "$a"
                              }
                      },
                      {
                              "$match" : {
                                      "c" : 1
                              }
                      },
                      {
                              "$sort" : {
                                      "sortKey" : {
                                              "b" : 1
                                      },
                                      "limit" : NumberLong(1)
                              }
                      }
              ],
              "ok" : 1
      }
      

      Conclusion

      Ideally there are two changes here:

      • Avoid moving $match ahead of a $sort with limit.
      • Move $match to appear after a $sort with limit of 1.

      The second could be optional, since the first would allow users to manually adjust their aggregation (based on their N and k).

      The server should avoid moving $match ahead of a $sort with limit.

      This change may be harder if the $match move happens before the $limit is coalesced into the $sort.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: