[SERVER-32141] Support for non-ASCII characters in $toLower Created: 20/Nov/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Kaitlin Mahar Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Participants:

 Description   

From docs "$toLower only has a well-defined behavior for strings of ASCII characters."

As a result, BIC pushdown of the lcase (lowercase) function gives incorrect results for some characters, for example: ƏŨÓ€. Lowercase versions would be əũó€, but

{ $toLower : "ƏŨÓ€"}

leaves the characters unchanged.

It would be great if additional characters like these were supported.



 Comments   
Comment by Kyle Suarez [ 01/Dec/17 ]

Looks like we use boost::toLower() to perform the case conversion. I wonder if it supports all Unicode characters and not just ASCII? It wasn't clear to me reading the boost/algorithm/string/detail/case_conv.hpp header – it looks like it calls std::tolower(), which appears to be locale-dependent.

Case conversion is well-defined in the Unicode standard, so maybe we could try to use the ICU library to perform the case conversion?

Comment by Adinoyi Omuya [ 20/Nov/17 ]

From https://docs.mongodb.com/manual/reference/operator/aggregation/toLower/#behavior:

$toLower only has a well-defined behavior for strings of ASCII characters.

I'm not sure how this interacts with locales but probably best to update the description and move this to the SERVER project as a feature request.

Generated at Thu Feb 08 04:29:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.