[SERVER-29918] stemming behavior for diacritics causes incorrect results Created: 29/Jun/17  Updated: 27/Oct/23  Resolved: 09/Nov/17

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: adrien petel Assignee: Kyle Suarez
Resolution: Works as Designed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 16.04, mongodb 3.4.4


Issue Links:
Duplicate
is duplicated by SERVER-30253 diacriticSensitive question Closed
Related
is related to SERVER-30726 inconsistent treatment of stopwords w... Closed
Operating System: ALL
Steps To Reproduce:

> db.test.insertMany([  
   { "_id":1, "name":"iphone" },
   { "_id":2, "name":"iphône" },
   { "_id":3, "name":"iphonë" },
   { "_id":4, "name":"iphônë" }
])
 
 
> db.test.ensureIndex({name: "text"})
 
> db.test.find({$text: {$search: "iphone"}})
{ "_id" : 1, "name" : "iphone" }
{ "_id" : 2, "name" : "iphône" }
 
> db.test.find({name: "iphone"}).collation({locale: "en", strength: 1})
{ "_id" : 1, "name" : "iphone" }
{ "_id" : 2, "name" : "iphône" }
{ "_id" : 3, "name" : "iphonë" }
{ "_id" : 4, "name" : "iphônë" }

Sprint: Query 2017-07-31, Query 2017-10-02, Query 2017-10-23, Query 2017-11-13
Participants:

 Description   

$text search is not diacritic insensitive if the word contains a dieresis ( ¨ ). Dieresis is categorized as diacritic in Unicode 8.0 Character Database Prop List, cf http://www.unicode.org/Public/8.0.0/ucd/PropList.txt

Search with collation works fine with

strength = 1



 Comments   
Comment by Kyle Suarez [ 09/Nov/17 ]

I've taken another look at the issue here and thoroughly examined what happens with regard to stemming and diacritic stripping. As Dan mentions, the stemmer must be diacritic-sensitive because diacritics affect stemming, even in languages like English. For example, in English:

  • "resume" is stemmed to "resum", as you'd expect. Its conjugated forms "resumed", "resuming", etc. all have the same stem.
  • "résumé" is stemmed simply to itself, as it is a noun and has no simpler form.

The current text search engine is written in a way that errs on the side of "correctness".
That being said, I am definitely sympathetic to the argument that "résumé" is commonly spelled as "resume" in everyday speech. However, changing the way text search works with regard to stemming and diacritic stripping will require a much larger project and detailed design. Based on this assessment, I'm going to close this ticket as Works as Designed.

For now, truly diacritic-insensitive queries should use either collation or a text index language of "none" (but will lose out on the benefit of stemming).

Comment by Kyle Suarez [ 10/Aug/17 ]

It still seems like something is off here, though, like there is an inconsistent approach to the way we perform diacritic stripping and stemming. In any case, I've stopped investigating this ticket as the Query Team's priority is on 3.6 scheduled features. It does seem worth it, though, for someone to investigate this behavior further once we revisit the tickets on the backlog.

Comment by Daniel Pasette (Inactive) [ 03/Aug/17 ]

The stemmer is diacritic sensitive and it must be because accents have meaning in some languages.
See this comment: https://github.com/mongodb/mongo/blob/r3.5.10/src/mongo/db/fts/fts_unicode_tokenizer.cpp#L96

Comment by Kyle Suarez [ 25/Jul/17 ]

Good point... I tried

std::cout << "Stemmed version of iphone: " << s.stem("iphone") << std::endl;
std::cout << "Stemmed version of iphoné: " << s.stem("iphoné") << std::endl;
std::cout << "Stemmed version of iphonë: " << s.stem("iphonë") << std::endl;

and got

Stemmed version of iphone: iphon
Stemmed version of iphoné: iphoné
Stemmed version of iphonë: iphonë

I'm putting this ticket into Needs Triage, so that the query team can triage this ticket at the next planning meeting. Whoever picks up this ticket should look at the places where stemming happens and see if we can strip diacritics before it occurs.

Comment by Kelsey Schubert [ 25/Jul/17 ]

I'd argue the problem isn't that we stem "iphone" to "iphon" (the stem doesn't have to be a valid root), but that we don't stem "iphoné" to "iphon". If our stemming isn't diacritic insensitive, our queries can't be. Can we change "iphoné" to "iphone" before it reaches the stemmer so it generates the same root?

Comment by Kyle Suarez [ 25/Jul/17 ]

After some investigation, I've found that the problem is not a diacritic problem, but a stemming problem. In your text search, the default language is English. Unfortunately, our vendored third-party stemming library, libstemmer.c, stems the word "iphone" into "iphon" when in English mode. Thus, it cuts off the "e" completely and is not included in the search.

When changing the language to "none", stemming does not occur, and I find the results as usual.

> db.text.find()
{ "_id" : ObjectId("59778fac798c05e256b74092"), "t" : "iphone" }
{ "_id" : ObjectId("59778faf798c05e256b74093"), "t" : "iphoné" }
{ "_id" : ObjectId("59778fb2798c05e256b74094"), "t" : "iphonë" }
 
> db.text.find({$text: {$search: "iphone"}})
{ "_id" : ObjectId("59778fac798c05e256b74092"), "t" : "iphone" }
 
> db.text.find({$text: {$search: "iphone", $language: "none"}})
{ "_id" : ObjectId("59778ef3798c05e256b74086"), "t" : "iphonë" }
{ "_id" : ObjectId("59778faf798c05e256b74093"), "t" : "iphoné" }
{ "_id" : ObjectId("59778fb2798c05e256b74094"), "t" : "iphonë" }

Comment by Ian Whalen (Inactive) [ 14/Jul/17 ]

Reminder: kyle.suarez please review to see if you can find the underlying cause.

Comment by Kelsey Schubert [ 29/Jun/17 ]

Hi felix2626,

Thank you for reporting this issue. I've marked this ticket to be scheduled against currently planned work. Please continue to watch this ticket for updates.

Kind regards,
Thomas

Generated at Thu Feb 08 04:22:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.