Details
-
Bug
-
Resolution: Fixed
-
Critical - P2
-
None
-
None
-
Fully Compatible
-
v3.4
-
Query 2017-07-10
-
(copied to CRM)
Description
Background — $sort + $match
When $match follows $sort it gets moved to be ahead of the $sort. This makes sense because it reduces the N in the O(NlogN) of the sort, ie. the aggregation:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
|
{ $sort: { b: 1 } },
|
{ $match: { c: 1 } } ] )
|
can be more efficiently implementated (with identical results) by:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
|
{ $match: { c: 1 } },
|
{ $sort: { b: 1 } } ] )
|
And indeed both of these aggregations have the same explain plan:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
|
{ $sort: { b: 1 } },
|
{ $match: { c: 1 } } ] )
|
{
|
"stages" : [
|
{
|
"$cursor" : {
|
"query" : {
|
|
},
|
"fields" : {
|
"a" : 1,
|
"_id" : 0
|
},
|
"queryPlanner" : {
|
"plannerVersion" : 1,
|
"namespace" : "test.foo",
|
"indexFilterSet" : false,
|
"parsedQuery" : {
|
|
},
|
"winningPlan" : {
|
"stage" : "COLLSCAN",
|
"direction" : "forward"
|
},
|
"rejectedPlans" : [ ]
|
}
|
}
|
},
|
{
|
"$group" : {
|
"_id" : "$a"
|
}
|
},
|
{
|
"$match" : {
|
"c" : 1
|
}
|
},
|
{
|
"$sort" : {
|
"sortKey" : {
|
"b" : 1
|
}
|
}
|
}
|
],
|
"ok" : 1
|
}
|
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
|
{ $match: { c: 1 } },
|
{ $sort: { b: 1 } } ] )
|
{
|
"stages" : [
|
{
|
"$cursor" : {
|
"query" : {
|
|
},
|
"fields" : {
|
"a" : 1,
|
"_id" : 0
|
},
|
"queryPlanner" : {
|
"plannerVersion" : 1,
|
"namespace" : "test.foo",
|
"indexFilterSet" : false,
|
"parsedQuery" : {
|
|
},
|
"winningPlan" : {
|
"stage" : "COLLSCAN",
|
"direction" : "forward"
|
},
|
"rejectedPlans" : [ ]
|
}
|
}
|
},
|
{
|
"$group" : {
|
"_id" : "$a"
|
}
|
},
|
{
|
"$match" : {
|
"c" : 1
|
}
|
},
|
{
|
"$sort" : {
|
"sortKey" : {
|
"b" : 1
|
}
|
}
|
}
|
],
|
"ok" : 1
|
}
|
Background — $sort + $limit
When $sort is followed by $limit (a common use case), the $limit gets pushed into the $sort stage. This allows a more efficient top-k algorithm to be used, which is usually O(Nlogk) (which reduces to O(N) if k is small) — compared to O(NlogN) to naively sort the documents and then apply the limit.
eg. in the following aggregation, the maximum document output by $group can be found after a single pass:
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
|
{ $sort: { b: 1 } }, { $limit: 1 } ] )
|
{
|
"stages" : [
|
{
|
"$cursor" : {
|
"query" : {
|
|
},
|
"fields" : {
|
"a" : 1,
|
"_id" : 0
|
},
|
"queryPlanner" : {
|
"plannerVersion" : 1,
|
"namespace" : "test.foo",
|
"indexFilterSet" : false,
|
"parsedQuery" : {
|
|
},
|
"winningPlan" : {
|
"stage" : "COLLSCAN",
|
"direction" : "forward"
|
},
|
"rejectedPlans" : [ ]
|
}
|
}
|
},
|
{
|
"$group" : {
|
"_id" : "$a"
|
}
|
},
|
{
|
"$sort" : {
|
"sortKey" : {
|
"b" : 1
|
},
|
"limit" : NumberLong(1)
|
}
|
}
|
],
|
"ok" : 1
|
}
|
Problem — $sort + $limit + $match
Now consider when a $match occurs after $sort + $limit. This table shows the number of documents processed after the $group stage, with a $limit of 1, for the best case ($match matches no documents) and worst case ($match matches all documents):
aggregation | best case | worst case | |||
---|---|---|---|---|---|
|
N | 2N + 1 | |||
|
N | N + 2 |
This shows that for $sort + $limit: 1 it is better (and never worse) for the $match to appear afterwards. For larger k, this may not be the case, and for k = N, the problem degenerates to sorting without a limit (and hence the $match is better off before the $sort).
Unfortunately, the behaviour of the server is to always move $match ahead of the $sort + $limit (even though sometimes this is worse):
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
|
... { $sort: { b: 1 } }, { $limit: 1 },
|
... { $match: { c: 1 } } ] )
|
{
|
"stages" : [
|
{
|
"$cursor" : {
|
"query" : {
|
|
},
|
"fields" : {
|
"a" : 1,
|
"_id" : 0
|
},
|
"queryPlanner" : {
|
"plannerVersion" : 1,
|
"namespace" : "test.foo",
|
"indexFilterSet" : false,
|
"parsedQuery" : {
|
|
},
|
"winningPlan" : {
|
"stage" : "COLLSCAN",
|
"direction" : "forward"
|
},
|
"rejectedPlans" : [ ]
|
}
|
}
|
},
|
{
|
"$group" : {
|
"_id" : "$a"
|
}
|
},
|
{
|
"$match" : {
|
"c" : 1
|
}
|
},
|
{
|
"$sort" : {
|
"sortKey" : {
|
"b" : 1
|
},
|
"limit" : NumberLong(1)
|
}
|
}
|
],
|
"ok" : 1
|
}
|
Conclusion
Ideally there are two changes here:
Avoid moving $match ahead of a $sort with limit.Move $match to appear after a $sort with limit of 1.
The second could be optional, since the first would allow users to manually adjust their aggregation (based on their N and k).
The server should avoid moving $match ahead of a $sort with limit.
This change may be harder if the $match move happens before the $limit is coalesced into the $sort.