-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Critical - P2
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
None
-
Fully Compatible
-
v3.4
-
Query 2017-07-10
-
(copied to CRM)
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Background — $sort + $match
When $match follows $sort it gets moved to be ahead of the $sort. This makes sense because it reduces the N in the O(NlogN) of the sort, ie. the aggregation:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
{ $sort: { b: 1 } },
{ $match: { c: 1 } } ] )
can be more efficiently implementated (with identical results) by:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
{ $match: { c: 1 } },
{ $sort: { b: 1 } } ] )
And indeed both of these aggregations have the same explain plan:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
{ $sort: { b: 1 } },
{ $match: { c: 1 } } ] )
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"a" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.foo",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$a"
}
},
{
"$match" : {
"c" : 1
}
},
{
"$sort" : {
"sortKey" : {
"b" : 1
}
}
}
],
"ok" : 1
}
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
{ $match: { c: 1 } },
{ $sort: { b: 1 } } ] )
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"a" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.foo",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$a"
}
},
{
"$match" : {
"c" : 1
}
},
{
"$sort" : {
"sortKey" : {
"b" : 1
}
}
}
],
"ok" : 1
}
Background — $sort + $limit
When $sort is followed by $limit (a common use case), the $limit gets pushed into the $sort stage. This allows a more efficient top-k algorithm to be used, which is usually O(Nlogk) (which reduces to O(N) if k is small) — compared to O(NlogN) to naively sort the documents and then apply the limit.
eg. in the following aggregation, the maximum document output by $group can be found after a single pass:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
{ $sort: { b: 1 } }, { $limit: 1 } ] )
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"a" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.foo",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$a"
}
},
{
"$sort" : {
"sortKey" : {
"b" : 1
},
"limit" : NumberLong(1)
}
}
],
"ok" : 1
}
Problem — $sort + $limit + $match
Now consider when a $match occurs after $sort + $limit. This table shows the number of documents processed after the $group stage, with a $limit of 1, for the best case ($match matches no documents) and worst case ($match matches all documents):
| aggregation | best case | worst case |
|---|---|---|
[ { $group: { _id: '$a' } },
{ $match: { c: 1 } },
{ $sort: { b: 1 } }, { $limit: 1 } ]
|
N | 2N + 1 |
[ { $group: { _id: '$a' } },
{ $sort: { b: 1 } }, { $limit: 1 },
{ $match: { c: 1 } } ]
|
N | N + 2 |
This shows that for $sort + $limit: 1 it is better (and never worse) for the $match to appear afterwards. For larger k, this may not be the case, and for k = N, the problem degenerates to sorting without a limit (and hence the $match is better off before the $sort).
Unfortunately, the behaviour of the server is to always move $match ahead of the $sort + $limit (even though sometimes this is worse):
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
... { $sort: { b: 1 } }, { $limit: 1 },
... { $match: { c: 1 } } ] )
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"a" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.foo",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$a"
}
},
{
"$match" : {
"c" : 1
}
},
{
"$sort" : {
"sortKey" : {
"b" : 1
},
"limit" : NumberLong(1)
}
}
],
"ok" : 1
}
Conclusion
Ideally there are two changes here:
Avoid moving $match ahead of a $sort with limit.Move $match to appear after a $sort with limit of 1.
The second could be optional, since the first would allow users to manually adjust their aggregation (based on their N and k).
The server should avoid moving $match ahead of a $sort with limit.
This change may be harder if the $match move happens before the $limit is coalesced into the $sort.