[SERVER-29918] stemming behavior for diacritics causes incorrect results Created: 29/Jun/17 Updated: 27/Oct/23 Resolved: 09/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | 3.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | adrien petel | Assignee: | Kyle Suarez |
| Resolution: | Works as Designed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 16.04, mongodb 3.4.4 |
||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||
| Sprint: | Query 2017-07-31, Query 2017-10-02, Query 2017-10-23, Query 2017-11-13 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
$text search is not diacritic insensitive if the word contains a dieresis ( ¨ ). Dieresis is categorized as diacritic in Unicode 8.0 Character Database Prop List, cf http://www.unicode.org/Public/8.0.0/ucd/PropList.txt Search with collation works fine with
|
| Comments |
| Comment by Kyle Suarez [ 09/Nov/17 ] | ||||||||||||
|
I've taken another look at the issue here and thoroughly examined what happens with regard to stemming and diacritic stripping. As Dan mentions, the stemmer must be diacritic-sensitive because diacritics affect stemming, even in languages like English. For example, in English:
The current text search engine is written in a way that errs on the side of "correctness". For now, truly diacritic-insensitive queries should use either collation or a text index language of "none" (but will lose out on the benefit of stemming). | ||||||||||||
| Comment by Kyle Suarez [ 10/Aug/17 ] | ||||||||||||
|
It still seems like something is off here, though, like there is an inconsistent approach to the way we perform diacritic stripping and stemming. In any case, I've stopped investigating this ticket as the Query Team's priority is on 3.6 scheduled features. It does seem worth it, though, for someone to investigate this behavior further once we revisit the tickets on the backlog. | ||||||||||||
| Comment by Daniel Pasette (Inactive) [ 03/Aug/17 ] | ||||||||||||
|
The stemmer is diacritic sensitive and it must be because accents have meaning in some languages. | ||||||||||||
| Comment by Kyle Suarez [ 25/Jul/17 ] | ||||||||||||
|
Good point... I tried
and got
I'm putting this ticket into Needs Triage, so that the query team can triage this ticket at the next planning meeting. Whoever picks up this ticket should look at the places where stemming happens and see if we can strip diacritics before it occurs. | ||||||||||||
| Comment by Kelsey Schubert [ 25/Jul/17 ] | ||||||||||||
|
I'd argue the problem isn't that we stem "iphone" to "iphon" (the stem doesn't have to be a valid root), but that we don't stem "iphoné" to "iphon". If our stemming isn't diacritic insensitive, our queries can't be. Can we change "iphoné" to "iphone" before it reaches the stemmer so it generates the same root? | ||||||||||||
| Comment by Kyle Suarez [ 25/Jul/17 ] | ||||||||||||
|
After some investigation, I've found that the problem is not a diacritic problem, but a stemming problem. In your text search, the default language is English. Unfortunately, our vendored third-party stemming library, libstemmer.c, stems the word "iphone" into "iphon" when in English mode. Thus, it cuts off the "e" completely and is not included in the search. When changing the language to "none", stemming does not occur, and I find the results as usual.
| ||||||||||||
| Comment by Ian Whalen (Inactive) [ 14/Jul/17 ] | ||||||||||||
|
Reminder: kyle.suarez please review to see if you can find the underlying cause. | ||||||||||||
| Comment by Kelsey Schubert [ 29/Jun/17 ] | ||||||||||||
|
Hi felix2626, Thank you for reporting this issue. I've marked this ticket to be scheduled against currently planned work. Please continue to watch this ticket for updates. Kind regards, |