[SERVER-8423] Text search case folding needs utf-8 support Created: 31/Jan/13 Updated: 05/Dec/16 Resolved: 11/Aug/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | 2.3.2 |
| Fix Version/s: | 3.1.7 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | J Rassi | Assignee: | Adam Chelminski (Inactive) |
| Resolution: | Done | Votes: | 18 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Major Change | ||||||||||||||||||||||||||||||||
| Sprint: | Platform 6 07/17/15, Platform 8 08/28/15, Platform 7 08/10/15 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
e.g. for Russian queries, "Как" currently lowercases to itself, whereas it should lowercase to "как". Needed for stopword removal, matching, etc.
|
| Comments |
| Comment by Adam Chelminski (Inactive) [ 11/Aug/15 ] |
|
This fix is integrated into text index v3, which will not work at all with previous versions of MongoDB. |
| Comment by Agrumas [X] [ 18/Jul/15 ] |
|
I am very excited about upcoming fulltext-search improvements, thanks @adam.chelminski |
| Comment by Pavel Chertorogov [ 29/Jun/15 ] |
|
UP!!! 2 years open this problem with UTF-8 collation |
| Comment by Andrey Yurchenkov [ 07/May/15 ] |
|
Russian lang, version 3.0.1 |
| Comment by Matt Kangas [ 03/Feb/15 ] |
|
|
| Comment by Alexander Black [X] [ 08/Dec/14 ] |
|
Version 2.6.5 same problem |
| Comment by Nikita Dedik [ 26/Nov/14 ] |
|
Same problem here, Russian language. Manual says: "If the index language is English, text indexes are case-insensitive for non-diacritics; i.e. case insensitive for [A-z]." - why is it so, in the end? |
| Comment by Roman [ 06/Feb/14 ] |
|
This bug is very critical for russian users of mongodb! |
| Comment by Valentin Kostadinov [ 13/Jan/14 ] |
|
Looking forward to having this bug fixed. Currently, I'm creating a "normalized" field (simply doing the lower case folding myself) and then creating the text index on that field. It's a bit of a pain point. Should be really easy to fix at least for the major languages. Cyrillic would be a big step forward and should be very easy to fix. |
| Comment by Cyrill [ 10/Dec/13 ] |
|
Any news about this bug ? |
| Comment by J Rassi [ 31/Jan/13 ] |
|
Case folding table available at <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>, regularly updated. Could process this similarly to processing of stopword lists (and then encoding as UTF-8). |