-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
None
-
Fully Compatible
-
v3.4
-
Query 2017-07-10
-
(copied to CRM)
Background — $sort + $match
When $match follows $sort it gets moved to be ahead of the $sort. This makes sense because it reduces the N in the O(NlogN) of the sort, ie. the aggregation:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $match: { c: 1 } } ] )
can be more efficiently implementated (with identical results) by:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $match: { c: 1 } }, { $sort: { b: 1 } } ] )
And indeed both of these aggregations have the same explain plan:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $match: { c: 1 } } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$match" : { "c" : 1 } }, { "$sort" : { "sortKey" : { "b" : 1 } } } ], "ok" : 1 } > db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $match: { c: 1 } }, { $sort: { b: 1 } } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$match" : { "c" : 1 } }, { "$sort" : { "sortKey" : { "b" : 1 } } } ], "ok" : 1 }
Background — $sort + $limit
When $sort is followed by $limit (a common use case), the $limit gets pushed into the $sort stage. This allows a more efficient top-k algorithm to be used, which is usually O(Nlogk) (which reduces to O(N) if k is small) — compared to O(NlogN) to naively sort the documents and then apply the limit.
eg. in the following aggregation, the maximum document output by $group can be found after a single pass:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $limit: 1 } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$sort" : { "sortKey" : { "b" : 1 }, "limit" : NumberLong(1) } } ], "ok" : 1 }
Problem — $sort + $limit + $match
Now consider when a $match occurs after $sort + $limit. This table shows the number of documents processed after the $group stage, with a $limit of 1, for the best case ($match matches no documents) and worst case ($match matches all documents):
aggregation | best case | worst case |
---|---|---|
[ { $group: { _id: '$a' } },
{ $match: { c: 1 } },
{ $sort: { b: 1 } }, { $limit: 1 } ]
|
N | 2N + 1 |
[ { $group: { _id: '$a' } },
{ $sort: { b: 1 } }, { $limit: 1 },
{ $match: { c: 1 } } ]
|
N | N + 2 |
This shows that for $sort + $limit: 1 it is better (and never worse) for the $match to appear afterwards. For larger k, this may not be the case, and for k = N, the problem degenerates to sorting without a limit (and hence the $match is better off before the $sort).
Unfortunately, the behaviour of the server is to always move $match ahead of the $sort + $limit (even though sometimes this is worse):
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, ... { $sort: { b: 1 } }, { $limit: 1 }, ... { $match: { c: 1 } } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$match" : { "c" : 1 } }, { "$sort" : { "sortKey" : { "b" : 1 }, "limit" : NumberLong(1) } } ], "ok" : 1 }
Conclusion
Ideally there are two changes here:
Avoid moving $match ahead of a $sort with limit.Move $match to appear after a $sort with limit of 1.
The second could be optional, since the first would allow users to manually adjust their aggregation (based on their N and k).
The server should avoid moving $match ahead of a $sort with limit.
This change may be harder if the $match move happens before the $limit is coalesced into the $sort.