Loading...

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: 3.4.6, 3.5.10
Affects Version/s: None
Component/s: Aggregation Framework
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v3.4
Sprint:
Query 2017-07-10
Case:
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Background — `$sort` + `$match`

When $match follows $sort it gets moved to be ahead of the $sort. This makes sense because it reduces the N in the O(NlogN) of the sort, ie. the aggregation:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $sort: { b: 1 } },
                                { $match: { c: 1 } } ] )

can be more efficiently implementated (with identical results) by:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $match: { c: 1 } },
                                { $sort: { b: 1 } } ] )

And indeed both of these aggregations have the same explain plan:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $sort: { b: 1 } },
                                { $match: { c: 1 } } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {

                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {

                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$match" : {
                                "c" : 1
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                }
                        }
                }
        ],
        "ok" : 1
}
> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $match: { c: 1 } },
                                { $sort: { b: 1 } } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {

                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {

                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$match" : {
                                "c" : 1
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                }
                        }
                }
        ],
        "ok" : 1
}

Background — `$sort` + `$limit`

When $sort is followed by $limit (a common use case), the $limit gets pushed into the $sort stage. This allows a more efficient top-k algorithm to be used, which is usually O(Nlogk) (which reduces to O(N) if k is small) — compared to O(NlogN) to naively sort the documents and then apply the limit.

eg. in the following aggregation, the maximum document output by $group can be found after a single pass:

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
                                { $sort: { b: 1 } }, { $limit: 1 } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {

                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {

                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                },
                                "limit" : NumberLong(1)
                        }
                }
        ],
        "ok" : 1
}

Problem — `$sort` + `$limit` + `$match`

Now consider when a $match occurs after $sort + $limit. This table shows the number of documents processed after the $group stage, with a $limit of 1, for the best case ($match matches no documents) and worst case ($match matches all documents):

aggregation	best case	worst case
[ { $group: { _id: '$a' } }, { $match: { c: 1 } }, { $sort: { b: 1 } }, { $limit: 1 } ]	`N`	`2N + 1`
[ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $limit: 1 }, { $match: { c: 1 } } ]	`N`	`N + 2`

This shows that for $sort + $limit: 1 it is better (and never worse) for the $match to appear afterwards. For larger k, this may not be the case, and for k = N, the problem degenerates to sorting without a limit (and hence the $match is better off before the $sort).

Unfortunately, the behaviour of the server is to always move $match ahead of the $sort + $limit (even though sometimes this is worse):

> db.foo.explain().aggregate( [ { $group: { _id: '$a' } },
... { $sort: { b: 1 } }, { $limit: 1 },
... { $match: { c: 1 } } ] )
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {

                                },
                                "fields" : {
                                        "a" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "test.foo",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {

                                        },
                                        "winningPlan" : {
                                                "stage" : "COLLSCAN",
                                                "direction" : "forward"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$a"
                        }
                },
                {
                        "$match" : {
                                "c" : 1
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "b" : 1
                                },
                                "limit" : NumberLong(1)
                        }
                }
        ],
        "ok" : 1
}

Conclusion

~~Ideally there are two changes here:~~

~~Avoid moving $match ahead of a $sort with limit.~~
~~Move $match to appear after a $sort with limit of 1.~~

~~The second could be optional, since the first would allow users to manually adjust their aggregation (based on their N and k).~~

The server should avoid moving $match ahead of a $sort with limit.

This change may be harder if the $match move happens before the $limit is coalesced into the $sort.

Details

Description

Background — $sort + $match

Background — $sort + $limit

Problem — $sort + $limit + $match

Conclusion

Attachments

Forms

Activity

People

Dates

Background — `$sort` + `$match`

Background — `$sort` + `$limit`

Problem — `$sort` + `$limit` + `$match`