-
Type: Task
-
Resolution: Won't Do
-
Priority: Major - P3
-
Affects Version/s: None
-
Labels:None
-
0.4
Currently, the documentation for text indexes has this to say about diacritic insensitivity:
With version 3, text index is diacritic insensitive. That is, the index does not distinguish between characters that contain diacritical marks and their non-marked counterpart, such as é, ê, and e. More specifically, the text index strips the characters categorized as diacritics in Unicode 8.0 Character Database Prop List.
This is mostly correct but our algorithm "misses" some cases because of some oddities in the Unicode standard. Our algorithm takes a Unicode codepoint (say Ç), uses a standard decomposition algorithm to break it down (in this case, C + ◌̧), and then removes diacritic characters (the combining cedilla character). However, there are letters like Ł and Ø that simply decompose to themselves in the standard decomposition algorithm and so do not get transformed into L and O, respectively. See the section "Non-decomposition of Certain Diacritics" in section 2.12 of the current Unicode standard.
I think it would be worth clarifying how this diacritic stripping is performed to avoid confusion; see SERVER-31152.
- is related to
-
SERVER-31152 Wrong diacritics for polish letter ł
- Closed
- links to