[SERVER-85337] $bucketAuto $first does not obey input order Created: 17/Jan/24  Updated: 02/Feb/24

Status: In Code Review
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Percy Assignee: Chi-I Huang
Resolution: Unresolved Votes: 0
Labels: query-director-triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-81571 Reconsider stable sort in sorter.cpp Blocked
Related
related to SERVER-85424 $group relies on stable sorting Closed
related to SERVER-81571 Reconsider stable sort in sorter.cpp Blocked
Assigned Teams:
Query Optimization
Operating System: ALL
Sprint: QO 2024-02-05, QO 2024-02-19
Participants:

 Description   

If I sort by {a: -1} and use $first, I should get the largest 'a' value within each group.

For example, using $bucket I get the expected result:

> db.c.find()
{ "_id" : 0, "a" : 0 }
{ "_id" : 1, "a" : 1 }
{ "_id" : 2, "a" : 2 }
{ "_id" : 3, "a" : 3 }
{ "_id" : 4, "a" : 4 }
{ "_id" : 5, "a" : 5 }
{ "_id" : 6, "a" : 6 }
{ "_id" : 7, "a" : 7 }
{ "_id" : 8, "a" : 8 }
{ "_id" : 9, "a" : 9 }
 
> db.c.aggregate([ {$sort: {a: -1}}, {$bucket: {groupBy: "$a", boundaries: [0, 5, 10], output: {a: {$first: "$a"}} }} ])
{ "_id" : 0, "a" : 4 }
{ "_id" : 5, "a" : 9 }

While the same thing with $bucketAuto gives the smallest value within each group:

> db.c.aggregate([ {$sort: {a: -1}}, {$bucketAuto: {groupBy: "$a", buckets: 2, output: {a: {$first: "$a"}} }} ])
{ "_id" : { "min" : 0, "max" : 5 }, "a" : 0 }
{ "_id" : { "min" : 5, "max" : 9 }, "a" : 5 }

It looks like internally $bucketAuto is doing something like:

  • stable-sort by the group key
  • scan the sorted input
    • accumulate each document into the current bucket
    • start a new bucket every N documents (based on input size / num buckets)

But this doesn’t work for order-sensitive accumulators like $first, because the sort loses some information about the original order--even using a stable sort. The documents within each dynamically-chosen bucket are sorted by (group key, input position), instead of by input position.



 Comments   
Comment by David Percy [ 22/Jan/24 ]

I think the solution would look like:

  • As $bucketAuto reads input it should pair each document with a counter, so we can tell later which document came first.
  • When we accumulate values into a bucket, any {$first: "$x"} accumulator should behave more like {$topN: {output: "$x", sortBy: {counter: 1}

    }}.

It would not be enough to sort by (group key, counter) because when multiple group keys land in the same bucket, we want $first to choose whichever document came first in the input (has the smallest counter value), regardless of group key.

Comment by David Percy [ 17/Jan/24 ]

This came up because brad.cater@mongodb.com was wondering where we rely on stable sorting, as part of SERVER-81571.

I think the stable sorting is not helpful for $bucketAuto. Within each bucket it needs to either:

  • use an order insensitive operator, like $max
    • stable sorting is not necessary
  • use an order-sensitive operator, like $first
    • stable sorting is not sufficient
Generated at Thu Feb 08 06:57:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.