[SERVER-15027] French stemming issue with word ending with "ée" Created: 25/Aug/14 Updated: 27/Aug/18 Resolved: 27/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | 2.6.4, 2.7.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Charles Billette | Assignee: | Stennie Steneker (Inactive) |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
French word ending with the letter "ée" like "glacée" are not stemmed the right way. Currently the index will not return document containing the word "glacée" if the search term is "glacee". But if you use the search term "glace" it will return documents containing the word "glacée". It look like the stemming process is "ée" = "e" and that is wrong and it should be "ée" = "ee" We got the problem for all the word ending by "ée"
Thanks |
| Comments |
| Comment by Stennie Steneker (Inactive) [ 27/Aug/18 ] | |||||||||||||||||||||
|
This is actually a limitation of algorithmic stemming. Stemming algorithms use generic heuristics to reduce words to an expected root form, but don't actually have the context of language or grammar. Accuracy will vary depending on the language, verb conjugation, and the stemming algorithm used. MongoDB (as at 4.0) uses the Snowball stemming library. You can test expected outcomes using the Snowball online demo. There are other approaches for more accurate inflection which are generally referred to as lemmatization. Lemmatization algorithms are more complex and start heading into the domain of natural language processing. There are many open source (and commercial) toolkits that you may be able to leverage if you want to implement more advanced text search in your application, but these are outside the current scope of the MongoDB text search feature. For more background, see: Stemming and lemmatization:
Regards, | |||||||||||||||||||||
| Comment by David Storch [ 16/Aug/18 ] | |||||||||||||||||||||
|
I agree with rubenrua's assessment that this appears to be an issue with the stemmer. MongoDB vendorizes the Snowball stemmer and provides a thin mongo::fts::Stemmer wrapper around it. In order to prove that this is an upstream issue with the third party library, I wrote a small unit test which shows that "glacée" is stemmed to "glac" in French mode:
Note that changing stemming behavior may require bumping the text index version, as there are text indexes in the wild which already have keys produced by this behavior. | |||||||||||||||||||||
| Comment by ruben gonzalez [ 08/Aug/18 ] | |||||||||||||||||||||
|
Can we update the issue summary to get more relevance. Any like "Text search with accent marks is not useful" | |||||||||||||||||||||
| Comment by ruben gonzalez [ 08/Aug/18 ] | |||||||||||||||||||||
|
Same for Spanish. Search a Spanish word without accent marks doesn't work if the accent marks is at the last syllable. Search without accent marks in Spanish is a behavior very common and this issue make text search not useful.
It's an issue with the stemming.
The result of stemming filologia is filologi and the result of stemming filología (with accent mark) is filolog. The termination -ía is very common in Spanish. Workaround: Specify "none" as language value to avoid the stemming. But this is not perfect solution; the stemming and the stop words depend on the language value, and not stop-words feature with "none". IMHO for this case custom stop-words will be a great feature. (SERVER-10062)
| |||||||||||||||||||||
| Comment by croze [ 09/Feb/16 ] | |||||||||||||||||||||
|
Add Affects Version/s: 3.2 | |||||||||||||||||||||
| Comment by croze [ 09/Feb/16 ] | |||||||||||||||||||||
|
With this bug, the French Full Text Search is unusable. Sample:
|