[SERVER-8428] Text search tokens need to be unicode normalized Created: 31/Jan/13 Updated: 06/Dec/22 Resolved: 16/Aug/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | 2.3.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | J Rassi | Assignee: | Backlog - Query Team (Inactive) |
| Resolution: | Done | Votes: | 2 |
| Labels: | query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query
|
| Operating System: | ALL |
| Participants: |
| Description |
|
e.g. Insert "café" twice (one with combining accent, one without):
But one doesn't get returned in search for the other:
|
| Comments |
| Comment by David Storch [ 16/Aug/19 ] | |||||||||||||||||||||||||||||||
|
This issue does not exist for version 3 text indexes:
Text index v3 is diacritic and case insensitive, and correctly handles unicode normalization as shown above. Older text index versions still suffer from this problem:
We do not plan to improve this for older text versions, so I'm closing this issue as "Done" as part of introducing v3 text indexes. | |||||||||||||||||||||||||||||||
| Comment by Will Shaver [ 10/Apr/14 ] | |||||||||||||||||||||||||||||||
|
Sure. https://jira.mongodb.org/browse/SERVER-13535 I love creating bugs it seems. | |||||||||||||||||||||||||||||||
| Comment by J Rassi [ 09/Apr/14 ] | |||||||||||||||||||||||||||||||
Unicode normalization does not strip diacritic marks from Unicode strings (though feel free to file a separate ticket for that feature request!). The example in the description illustrates that the Unicode string "café" (where é is encoded as "LATIN SMALL LETTER E WITH ACUTE") should be considered equivalent to the Unicode string "café" (where é is encoded as "LATIN SMALL LETTER E" + "COMBINING ACUTE ACCENT"). See http://unicode.org/reports/tr15/ for a description of Unicode normalization. | |||||||||||||||||||||||||||||||
| Comment by Will Shaver [ 09/Apr/14 ] | |||||||||||||||||||||||||||||||
|
Adding a couple keywords so this issue is easier to find: diacritical, diacritics, accents, accentuated. I had this issue come up with Pokémon vs Pokemon. |