[SERVER-79636] equivalent() function for $expr is not collation-aware Created: 02/Aug/23  Updated: 01/Feb/24

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Storch Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-79018 Implement MatchExpression hasher Closed
Assigned Teams:
Query Optimization
Participants:

 Description   

The MatchExpression interface offers MatchExpression::equivalent() which can be used to check whether two match expressions are the same. Consider the following two $expr match expressions:

// Display the data in the collection.
MongoDB Enterprise > db.c.find()
{ "_id" : ObjectId("64caa40c416866f24e97cc48"), "str" : "a" }
{ "_id" : ObjectId("64caa40e416866f24e97cc4a"), "str" : "A" }
{ "_id" : ObjectId("64caa410416866f24e97cc4c"), "str" : "b" }
 
// Query using lowercase constant.
MongoDB Enterprise > db.c.find({$expr: {$eq: ["$str", "a"]}}).collation({locale: "en_US", strength: 2})
{ "_id" : ObjectId("64caa40c416866f24e97cc48"), "str" : "a" }
{ "_id" : ObjectId("64caa40e416866f24e97cc4a"), "str" : "A" }
 
// Query using uppercase constant.
MongoDB Enterprise > db.c.find({$expr: {$eq: ["$str", "A"]}}).collation({locale: "en_US", strength: 2})
{ "_id" : ObjectId("64caa40c416866f24e97cc48"), "str" : "a" }
{ "_id" : ObjectId("64caa40e416866f24e97cc4a"), "str" : "A" }

These two queries use the case-insensitive collation and therefore are identical in meaning. However, the implementation of ExprMatchExpression::equivalent() is not collation-aware. Since we haven't implemented related ticket SERVER-30982 yet, ExprMatchExpression::equivalent() currently works by serializing both the left-hand side and right-hand side to a mongo::Value representation and then comparing the resulting values with the simple collator. Because we're using the simple collator, these two expressions will erroneously be considered non-equivalent.

This is not an issue which will result in a user facing bug as currently there is a stronger collation being used for comparison. Yet there is some potential that queries do miss out on a few optimizations due to a more strict comparison. The same also applies for the Hashing function from the Boolean simplification from SERVER-79018. For the scope of this ticket the implementation of ExprMatchExpression::equivalent() should respect comparisons with the collations in mind. This will have an effect on long-tailed customers.



 Comments   
Comment by David Storch [ 10/Oct/23 ]

Do we have a route forward for how to fix the problem observed here? It's not clear to me that it's worth spending too much time on this unless we identify that it actually can cause a user-facing bug.

Comment by Alexander Ignatyev [ 17/Aug/23 ]

SERVER-79018 implemented hasher for expressions in pipeline/expression_hasher.h, so if the equivalent() function is fixed the hasher should be changed accordingly as well.

Generated at Thu Feb 08 06:41:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.