[SERVER-15027] French stemming issue with word ending with "ée" Created: 25/Aug/14  Updated: 27/Aug/18  Resolved: 27/Aug/18

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 2.6.4, 2.7.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charles Billette Assignee: Stennie Steneker (Inactive)
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-10062 Add user configurable stop word lists... Backlog
Operating System: ALL
Participants:

 Description   

French word ending with the letter "ée" like "glacée" are not stemmed the right way.

Currently the index will not return document containing the word "glacée" if the search term is "glacee". But if you use the search term "glace" it will return documents containing the word "glacée".

It look like the stemming process is "ée" = "e" and that is wrong and it should be "ée" = "ee"

We got the problem for all the word ending by "ée"

  • glacée
  • purée
  • boutonnée

Thanks



 Comments   
Comment by Stennie Steneker (Inactive) [ 27/Aug/18 ]

This is actually a limitation of algorithmic stemming. Stemming algorithms use generic heuristics to reduce words to an expected root form, but don't actually have the context of language or grammar. Accuracy will vary depending on the language, verb conjugation, and the stemming algorithm used.

MongoDB (as at 4.0) uses the Snowball stemming library. You can test expected outcomes using the Snowball online demo.

There are other approaches for more accurate inflection which are generally referred to as lemmatization. Lemmatization algorithms are more complex and start heading into the domain of natural language processing. There are many open source (and commercial) toolkits that you may be able to leverage if you want to implement more advanced text search in your application, but these are outside the current scope of the MongoDB text search feature.

For more background, see: Stemming and lemmatization:

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Regards,
Stennie

Comment by David Storch [ 16/Aug/18 ]

I agree with rubenrua's assessment that this appears to be an issue with the stemmer. MongoDB vendorizes the Snowball stemmer and provides a thin mongo::fts::Stemmer wrapper around it. In order to prove that this is an upstream issue with the third party library, I wrote a small unit test which shows that "glacée" is stemmed to "glac" in French mode:

+TEST(StemmerTest, FrenchStemmingTestTextIndexV3) {
+    auto language =
+        unittest::assertGet(FTSLanguage::make("fr"_sd, TextIndexVersion::TEXT_INDEX_VERSION_3));
+    Stemmer stemmer(language);
+    ASSERT_EQUALS(stemmer.stem("glacée"), "glac");
+}

Note that changing stemming behavior may require bumping the text index version, as there are text indexes in the wild which already have keys produced by this behavior.

Comment by ruben gonzalez [ 08/Aug/18 ]

Can we update the issue summary to get more relevance. Any like "Text search with accent marks is not useful"

Comment by ruben gonzalez [ 08/Aug/18 ]

Same for Spanish. Search a Spanish word without accent marks doesn't work if the accent marks is at the last syllable.

Search without accent marks in Spanish is a behavior very common and this issue make text search not useful.

 

> db.quotes.createIndex(
...    { content : "text" },
...    { default_language: "spanish" }
... )
{ "createdCollectionAutomatically" : true, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1}
> db.quotes.insert({content: "Filología Clásica"})
WriteResult({ "nInserted" : 1 })
> db.quotes.find()
{ "_id" : ObjectId("5b6aa4d160067755ee9ef212"), "content" : "Filología Clásica" }
> db.quotes.find({ "$text": { "$search": 'filología', "$language": "es" } })
{ "_id" : ObjectId("5b6aa4d160067755ee9ef212"), "content" : "Filología Clásica" }
> db.quotes.find({ "$text": { "$search": 'filologia', "$language": "es" } })
// No result when finding the document using the text index without accent mark.

 

It's an issue with the stemming.

> db.quotes.find({ "$text": { "$search": 'filologia', "$language": "es" } }).explain("executionStats")
...
 "parsedTextQuery" : { "terms" : [ "filologi" ],
...
> db.quotes.find({ "$text": { "$search": 'filología', "$language": "es" } }).explain("executionStats")
...
 "parsedTextQuery" : { "terms" : [ "filolog" ],
...

The result of stemming filologia is filologi and the result of stemming filología (with accent mark) is filolog. The termination -ía is very common in Spanish.

Workaround: Specify "none" as language value to avoid the stemming. 

But this is not perfect solution; the stemming and the stop words depend on the language value, and not stop-words feature with "none".

IMHO for this case custom stop-words will be a great feature. (SERVER-10062)

 

Comment by croze [ 09/Feb/16 ]

Add Affects Version/s: 3.2

Comment by croze [ 09/Feb/16 ]

With this bug, the French Full Text Search is unusable.
Also there is an other probleme with the ' char

Sample:

db.articles.createIndex( { subject: "text" } )
db.articles.insert(
   [
     { _id: 1, subject: "été" },
     { _id: 2, subject: "l'été" },
   ]
)
db.articles.find({ "$text": { "$search": "été", "$language": "fr" } }).count()
// 0 => should return 2 for french
 
db.articles.find({ "$text": { "$search": "été"} }).count()
// 1 => ok for english

Generated at Thu Feb 08 03:36:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.