[SERVER-31152] Wrong diacritics for polish letter ł Created: 19/Sep/17 Updated: 27/Oct/23 Resolved: 07/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mateo | Assignee: | Kyle Suarez |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: |
|
||||||||
| Sprint: | Query 2017-10-23, Query 2017-11-13 | ||||||||
| Participants: | |||||||||
| Description |
|
According to docs https://docs.mongodb.com/manual/core/index-text/ text index |
| Comments |
| Comment by Wojciech Jakubas [ 08/Nov/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We are 4 years later and still text search does not work with polish ł/Ł letter. The workarounds mentioned higher up are not consistent compared to other "polish" letters like ź, ż, ó, ć, ą etc. They work OK, only ł/Ł does not. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mateo [ 08/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was using 3.4.3 from Opensuse 42.3 repository, but now I've upgraded to 3.4.10 from https://download.opensuse.org/repositories/server:/database/openSUSE_Leap_42.3/ and I have always the same result - 1 hit. Maybe I need to enable/disable something? I did a fresh install on Windows machine:
------------------ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Suarez [ 07/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
What version of the server are you using? I tested both the current tip of master (b937ec566) and MongoDB 3.4.9, and both have identical results: all three documents are found.
I also tested this in the presence of a text index on the "name" field, and the query correctly ignores the index:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mateo [ 07/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
So summarizing. How do I query for all 3 cities from Your third answer?
This will give me one hit. Quering text index with $language: 'en' will also give one hit. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Suarez [ 07/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi vikingpl, Looks like I chose my collations poorly in my first attempt. The Polish collation will treat L and Ł as distinct, as will the POSIX locale. In MongoDB, you can achieve the desired results by using a collation locale where "Ł" is not a standard character**, like English:
However, I will say that the Unicode collation algorithm is completely distinct from our diacritic-stripping algorithm. Going back to your original request, after examining the behavior of text search, I would say that the current diacritic-stripping algorithm is correct with regard to the letter Ł. If the Unicode Consortium is not going to specify a decomposition for Ł, I don't think that MongoDB should attempt to make one up, even if it "seems like" it should. Given all this, I'm going to close this ticket as Works as Designed. Collation provides an alternative to text search for this particular peculiarity. Best, **I found which locales would work by running this script and seeing which collations found all four results: https://gist.github.com/ksuarz/2e801814459ee2a013738cdf8c5ae9ef | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mateo [ 03/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Try
By default in MySQL/MariaDB utf8_unicode_ci is based on Unicode-4.x, so it inherits all Unicode-4.x features. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Suarez [ 03/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi vikingpl, Which collation are you using? I haven't been able to perform a search that treats "L" and "Ł" the same in both MongoDB and MariaDB. We can't use the Polish locale, since that locale will definitely treat the two letters as distinct no matter what the strength. In MongoDB, I tried en_US_POSIX:
In MariaDB, I tried both utf8_general_ci and utf8_unicode_ci:
It's worth noting that MongoDB only uses the official Unicode collation algorithm. There are other custom collation algorithms supported by Oracle MySQL, but those are totally non-standard. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mateo [ 03/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the clarification.
and with proper collation set there is no problem to search Ł because weight is the same as for letter L. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Suarez [ 02/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Given my findings above, I'm keeping this ticket in the "Needs Triage" queue so that it can be reconsidered by the Query Team as a whole during the next planning meeting. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kyle Suarez [ 02/Nov/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi vikingpl, Sorry for the delay in getting to this issue. To strip diacritics, MongoDB applies some transformations specified in the Unicode standard. To be specific, we decompose characters using NFD, remove characters that are classified as diacritics, and then re-compose the result with NFC. For example, this transformation breaks down Ç into C + ◌̧, removes the combining cedilla character and results in C. However, there are certain characters, like Ł (LATIN CAPITAL LETTER L WITH STROKE), for which there is no simpler Unicode decomposition even if you might expect it. MongoDB uses Unicode version 8.0, but this excerpt from Unicode 10.0 is still informative:
I can understand if you'd like a feature request where we expand our diacritic-stripping algorithm to also handle characters like Ł. However, this will probably involve a text index version bump and we presently don't have plans to take it upon ourselves to classify what is a diacritic and what is not beyond the current Unicode standard. I've opened Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Agarunov [ 19/Sep/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello vikingpl, Thank you for the report. I've managed to reproduce this behavior using the detailed steps provided. I've set this ticket to Needs Triage to be scheduled against our currently planned work, please watch this ticket for updates on this issue. Thanks, |