[SERVER-29647] Avoid moving $match to be before $sort + $limit Created: 15/Jun/17  Updated: 30/Oct/23  Resolved: 23/Jun/17

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 3.4.6, 3.5.10

Type: Bug Priority: Critical - P2
Reporter: Kevin Pulo Assignee: David Storch
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: Query 2017-07-10
Participants:
Case:

 Description   

Background — $sort + $match

When $match follows $sort it gets moved to be ahead of the $sort. This makes sense because it reduces the N in the O(NlogN) of the sort, ie. the aggregation:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $sort: { b: 1 } },
                                { $match: { c: 1 } } ] )

can be more efficiently implementated (with identical results) by:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $match: { c: 1 } },
                                { $sort: { b: 1 } } ] )

And indeed both of these aggregations have the same explain plan:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $sort: { b: 1 } },
                                { $match: { c: 1 } } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {
 
                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
 
                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$match" : {
                                "c" : 1
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                }
                        }
                }
        ],
        "ok" : 1
}
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $match: { c: 1 } },
                                { $sort: { b: 1 } } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {
 
                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
 
                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$match" : {
                                "c" : 1
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                }
                        }
                }
        ],
        "ok" : 1
}

Background — $sort + $limit

When $sort is followed by $limit (a common use case), the $limit gets pushed into the $sort stage. This allows a more efficient top-k algorithm to be used, which is usually O(Nlogk) (which reduces to O(N) if k is small) — compared to O(NlogN) to naively sort the documents and then apply the limit.

eg. in the following aggregation, the maximum document output by $group can be found after a single pass:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $sort: { b: 1 } }, { $limit: 1 } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {
 
                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
 
                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                },
                                "limit" : NumberLong(1)
                        }
                }
        ],
        "ok" : 1
}

Problem — $sort + $limit + $match

Now consider when a $match occurs after $sort + $limit. This table shows the number of documents processed after the $group stage, with a $limit of 1, for the best case ($match matches no documents) and worst case ($match matches all documents):

aggregation best case worst case

[ { $group: { _id: '$a' } },
  { $match: { c: 1 } },
  { $sort: { b: 1 } }, { $limit: 1 } ]

N 2N + 1

[ { $group: { _id: '$a' } },
  { $sort: { b: 1 } }, { $limit: 1 },
  { $match: { c: 1 } } ]

N N + 2

This shows that for $sort + $limit: 1 it is better (and never worse) for the $match to appear afterwards. For larger k, this may not be the case, and for k = N, the problem degenerates to sorting without a limit (and hence the $match is better off before the $sort).

Unfortunately, the behaviour of the server is to always move $match ahead of the $sort + $limit (even though sometimes this is worse):

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
... { $sort: { b: 1 } }, { $limit: 1 },
... { $match: { c: 1 } } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {
 
                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
 
                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$match" : {
                                "c" : 1
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                },
                                "limit" : NumberLong(1)
                        }
                }
        ],
        "ok" : 1
}

Conclusion

Ideally there are two changes here:

  • Avoid moving $match ahead of a $sort with limit.
  • Move $match to appear after a $sort with limit of 1.

The second could be optional, since the first would allow users to manually adjust their aggregation (based on their N and k).

The server should avoid moving $match ahead of a $sort with limit.

This change may be harder if the $match move happens before the $limit is coalesced into the $sort.



 Comments   
Comment by Githook User [ 26/Jun/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-29647 Fix $match swapping to avoid moving before $sort-$limit.

(cherry picked from commit 8b8845703f0c92bafca58ce5d9f36fd2327301d1)

Conflicts:
jstests/core/views/views_aggregation.js
Branch: v3.4
https://github.com/mongodb/mongo/commit/8bc18f2367d616ef1e68b3be4acfaa1d2e8daa6e

Comment by Githook User [ 23/Jun/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-29647 Fix $match swapping to avoid moving before $sort-$limit.
Branch: master
https://github.com/mongodb/mongo/commit/8b8845703f0c92bafca58ce5d9f36fd2327301d1

Comment by Asya Kamsky [ 19/Jun/17 ]

Right, since $match can't be moved ahead of $limit ever, it also can't be moved in front of sort-with-limit.

Comment by Kevin Pulo [ 19/Jun/17 ]

Nice counter-example! I've updated the "Conclusion" section of the description accordingly.

Comment by Charlie Swanson [ 15/Jun/17 ]

Missed the second point in your conclusion:

Move $match to appear after a $sort with limit of 1.

I think this is also incorrect, as it could possibly change the sort order.

Comment by Charlie Swanson [ 15/Jun/17 ]

I think this is actually even simpler. I think it is a straight-up bug to move a $match before a $sort if the $sort has absorbed a $limit:

(mongod-3.4.4) test> db.bar.aggregate([{$sort: {x: -1}}, {$limit: 1}, {$match: {x: {$mod: [2, 1]}}}])
      {
        "_id": ObjectId("5942e317b8328df32ca4fdb2"),
        "x": 3
      }
(mongod-3.4.4) test> db.bar.aggregate([
  {$sort: {x: -1}},
  {$limit: 1},
  {$project: {y: "$x"}} /* renames x to trick $match to stay after $sort */,
  {$match: {y: {$mod: [2, 1]}}}
])
// No results

Generated at Thu Feb 08 04:21:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.