[SERVER-32141] Support for non-ASCII characters in $toLower Created: 20/Nov/17 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Kaitlin Mahar | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query Optimization
|
| Participants: |
| Description |
|
From docs "$toLower only has a well-defined behavior for strings of ASCII characters." As a result, BIC pushdown of the lcase (lowercase) function gives incorrect results for some characters, for example: ƏŨÓ€. Lowercase versions would be əũó€, but
leaves the characters unchanged. It would be great if additional characters like these were supported. |
| Comments |
| Comment by Kyle Suarez [ 01/Dec/17 ] |
|
Looks like we use boost::toLower() to perform the case conversion. I wonder if it supports all Unicode characters and not just ASCII? It wasn't clear to me reading the boost/algorithm/string/detail/case_conv.hpp header – it looks like it calls std::tolower(), which appears to be locale-dependent. Case conversion is well-defined in the Unicode standard, so maybe we could try to use the ICU library to perform the case conversion? |
| Comment by Adinoyi Omuya [ 20/Nov/17 ] |
|
From https://docs.mongodb.com/manual/reference/operator/aggregation/toLower/#behavior: $toLower only has a well-defined behavior for strings of ASCII characters. I'm not sure how this interacts with locales but probably best to update the description and move this to the SERVER project as a feature request. |