[SERVER-19402] Change semantics of sorting by array fields in find and aggregate Created: 14/Jul/15  Updated: 19/May/22  Resolved: 21/Jul/17

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 3.5.11

Type: Improvement Priority: Major - P3
Reporter: Backlog - Query Team (Inactive) Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-10607 Docs for SERVER-19402: Change semanti... Closed
is documented by DOCS-11338 Add multikey index sort limitation to... Closed
Duplicate
is duplicated by SERVER-8152 aggregation array sort differs from q... Closed
is duplicated by SERVER-46092 Broken index after mongodb version up... Closed
Related
related to SERVER-31858 explodeForSort() planning path can in... Closed
related to SERVER-5065 a missing field nested within an arra... Closed
related to SERVER-8152 aggregation array sort differs from q... Closed
related to SERVER-11878 Results in incorrect order for sharde... Closed
related to SERVER-19394 Results in incorrect order for query ... Closed
related to SERVER-19397 Results in incorrect order for comple... Closed
related to SERVER-22872 Order by is not working in 3.2.3 Closed
related to SERVER-26655 $gt operation on array with index Closed
related to SERVER-32525 Sort order returned by a multikey ind... Closed
is related to SERVER-33387 Multikey index does not provide nonbl... Closed
is related to SERVER-35477 Sort by array field doesn't work Closed
is related to SERVER-480 figure out what to do with sorting by... Closed
is related to SERVER-20551 sort on multikey index after $elemMat... Closed
Backwards Compatibility: Minor Change
Sprint: Query 2017-06-19, Query 2017-07-10, Query 2017-07-31
Participants:
Case:

 Description   
Issue Status as of July 21, 2017

ISSUE SUMMARY
In versions of MongoDB prior to 3.5.11, sorting by a field containing an array suffered from multiple issues:

  • The sort orders for aggregate and find differed in the presence of arrays.
  • The sort order for find depended on the query predicate.
  • The sort order for find depended on the implementation details of the query planner.
  • The sort order for aggregate depended on the set of indexes available.

For more information, see the Pre-existing Behavior sections below.

In versions 3.5.11 and newer, the array sort semantics for find and aggregate have been changed to correct these semantic deficiencies, and to be consistent with one another. The new behavior is to sort documents according to the lowest-valued element of the array for ascending sorts, and according to the highest-valued element of the array for descending sorts. For a more detailed description of this behavior, see the Technical Details section below.

USER IMPACT
This is a minor breaking change. Applications that currently sort by an array field will experience a different sort order.

  • As a consequence of this change, it is no longer correct for a multikey index to be used to provide a sort on an array field.
  • In such cases, the query plan will instead include a blocking SORT stage.
  • For indexes that have path-level multikey metadata, as described in SERVER-15086, a multikey index will still be able to provide a sort as long as the sorted-on fields never contain an array.
  • Indexes built on the WiredTiger or In-memory storage engines on versions greater than or equal to 3.4.0 will always have path-level multikey metadata. Indexes built prior to 3.4 can be rebuilt on 3.4 or newer WiredTiger/In-memory in order to obtain multikey metadata.

TECHNICAL DETAILS
Arrays are now sorted according to their lowest-valued element. For example, consider sorting the following documents by {timestamps: 1}:

{ "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
{ "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }

Document 0 has two timestamps: July 21 2017, 3:31 PM UTC and July 21 2017, 1:31 PM UTC. Document 1 also has two timestamps: July 15 2017, 3:31 PM UTC and July 21 2017 5:31 PM UTC. Document 0 will sort according to its earliest time, 1:31 PM on July 21. Similarly, document 1 will sort according to its July 15 datetime. Therefore document 1 will sort before document 0 in both find and aggregate:

> db.c.find().sort({timestamps: 1});
{ "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
{ "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
 
> db.c.aggregate([{$sort: {timestamps: 1}}]);
{ "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
{ "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }

For a descending sort, the most recent time in the array will be used as the sort key: 3:31 PM on July 21 for document 0 and 5:31 PM on July 21 for document 1. Since the sort is descending these keys are then ordered from most recent to least recent, resulting in document 1 sorting before document 0:

> db.c.find().sort({timestamps: -1});
{ "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
{ "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
 
> db.c.aggregate([{$sort: {timestamps: -1}}])
{ "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
{ "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }

The semantics defining "lowest-valued array element" are identical to those used for index key generation. To determine the sort key for a document d given sort pattern p, we generate the set of index keys K for d by treating p as an index key pattern. The sort key is min(K), where min is determined by the index ordering. Consider the example of sorting the following documents by {"customers.id": -1, "customers.code": 1}:

{_id: 0, customers:  [{id: 1, code: "X"}, {id: 2, code: "Y"}]}
{_id: 1, customers:  [{id: 3, code: "W"}, {id: 3, code: "Z"}]}

The index keys for document 0 are (1, "X") and (2, "Y"). The smallest key in this set according to the sort pattern is (2, "Y"), since we are sorting by customers.id descending. Therefore, (2, "Y") is the sort key for document 0. By similar logic, (3, "W") is the sort key for document 1. Comparing these keys, (3, "W") sorts before (2, "Y"), meaning that document 1 sorts first.

PRE-EXISTING BEHAVIOR FOR FIND
Prior to this change, the find command's behavior was similar to the above, except the smallest in-bounds element was selected as the sort key. A key was considered in-bounds based on the query predicate. Consider this example:

// Query.
db.coll.find({a: {$gte: 0}}).sort({a: 1});
 
// Document.
{_id: 0, a: [-3, -2, 2, 3]}

The sort key for this document would be 2, because it is the smallest key that falls within the bounds of the query predicate.

PRE-EXISTING BEHAVIOR FOR AGGREGATE
For aggregation sorts prior to this change, the entire array was used as the sort key. The array sort keys were compared element-wise to determine the sort order of the result set. For instance, when sorting these documents in ascending order by field a:

{_id: 0, a: [3, 1, 5]}
{_id: 1, a: [3, 4, 0]}

The sort keys would be [3, 1, 5] and [3, 4, 0] respectively. Document 0 sorts first: although the first array elements are equal, the second array element breaks the tie.

Original description

When the query sort algorithm is given an array value to sort on, an array element is chosen as the sort key by using the query predicate to generate sort bounds (see this comment in sort.cpp for an explanation about why this is the case). Instead, we should consider changing the semantics of sorting on an array value, such that the entire array is used as the sort key.

Pros:

  • Sort key generation becomes much less expensive.
  • Query sort order will become more consistent with aggregation sort order.
  • Multiple outstanding bugs in the server related to sorting on an array value could be closed as "Gone away" (e.g. SERVER-8152, SERVER-11878, SERVER-19394, SERVER-19397).
  • Implementation of query sort algorithm could be vastly simplified.
  • It's unclear whether the current array sorting semantics are well-defined (in particular, it's unclear whether a complete solution exists to SERVER-19397).

Cons:

  • Backwards-breaking.
  • Multi-key indexes would have to be considered ineligible for providing a sort.


 Comments   
Comment by Githook User [ 21/Jul/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-19402 Change agg array sort semantics to match find.
Branch: master
https://github.com/mongodb/mongo/commit/0b27b9e6c1196c53ec6bf68793ad7b57cb33cd30

Comment by Githook User [ 20/Jun/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-19402 Change find command semantics for sorting on an array field.

This eliminates the behavior in which the lowest-valued
in-bounds index key was chosen as the sort key. After this
change, we instead choose the lowest key overall, which may
or may not be in-bounds. This change prevents the sort order
from depending on either the query predicate or the
implementation details of the query planner.

Note that it is no longer correct for a multikey index to
provide a sort over an array field. However, a non-array
field of a multikey index can provide a sort when that index
has path-level multikeyness metadata.
Branch: master
https://github.com/mongodb/mongo/commit/50623596fb62da49a2b1495d5e0cd852faf91f9f

Generated at Thu Feb 08 03:50:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.