[SERVER-25038] Comparisons made by $bucket at parse time must respect the collation Created: 13/Jul/16  Updated: 19/Nov/16  Resolved: 04/Oct/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Querying
Affects Version/s: None
Fix Version/s: 3.4.0-rc1

Type: Bug Priority: Critical - P2
Reporter: David Storch Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Sprint: Query 18 (08/05/16), Query 2016-08-29, Query 2016-09-19, Query 2016-10-10
Participants:
Linked BF Score: 0

 Description   

The difficulty is that parsing of the aggregation pipeline is done before lock acquisition, but the final collation is not always known until after the collection lock has been acquired (since we may need to read the collection's metadata describing the collection default collation). max.hirschhorn and myself brainstormed a few potential solutions:

  1. Parse the aggregation pipeline under the lock so we know what the collation is.
  2. Remove the comparisons and let users create buckets that will always be empty.
  3. Configure the $group stage to do the checks that require collation by making it aware that it's a $group for bucketing.
  4. Create a DocumentSourceBucket class that during its optimize() method does these checks, adds a DocumentSourceGroup and DocumentSourceSort to the pipeline, and then removes itself from the pipeline.


 Comments   
Comment by Githook User [ 04/Oct/16 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-25038 make LiteParsedDocumentSource destructor virtual
Branch: master
https://github.com/mongodb/mongo/commit/5396d9b72eea0d0d0086150eccf8c3b8ef710064

Comment by Githook User [ 03/Oct/16 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-25038 add LiteParsedPipeline

This provides a way to do pre-parse validity checks. Full
parsing of the Pipeline must be done under the collection
lock, when the collation is known.
Branch: master
https://github.com/mongodb/mongo/commit/8758a5c7be68effbc4a1f857787bfb7b201c6389

Comment by Charlie Swanson [ 10/Aug/16 ]

Code review url: https://mongodbcr.appspot.com/86280001/

Comment by Max Hirschhorn [ 21/Jul/16 ]

david.storch, I discussed this ticket a bit more with Charlie and he suggested adding a non-documented, internal option to ExpressionSwitch that verifies its case statements are sorted and returns an error if they aren't. This proposal is more akin to #3 in the list in the ticket's description, but at the level of the expression and not the stage. The argument against doing #4 is that it is currently possible to run a non-optimized pipeline and that would no longer be the case if a DocumentSourceBucket instance was added to the pipeline - there'd be no reason to give DocumentSourceBucket::getNext() a real implementation because it is expected to replace itself with a DocumentSourceGroup and a DocumentSourceSort.

There is a separate consideration that we should give for work on this ticket related to when expressions specified in $bucket are optimized. In the current implementation in the specification in SERVER-23815, the boundaries are allowed to be expressions that resolve to constants (when optimized()'d) and aren't just constants themselves.

'boundaries' must be constant values (can't use "$x", but can use {$add: [4, 5]}), and must be sorted.

While it is currently the case that we don't resolve the $cond expression to an ExpressionConstant unless all three of its operands can be resolved to constants, we may in the future want to optimize cases like the following to simply be the "then" or "else" expression.

{$cond: [{$eq: ["foo", "FOO"]}, "$x", 0]}

However, we would need to know the correct collation to compare "foo" and "FOO" under in order to do so. asya, are we willing to change the specification for $bucket to only accept constants and not expressions that resolve to constants in order to simplify some of the integration with collation in $bucket? Otherwise if we go with Charlie's proposal, we'd want to defer those checks to $switch as well.

Generated at Thu Feb 08 04:08:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.