[SERVER-26658] Full Text returns wrong results for Turkish Created: 17/Oct/16  Updated: 27/Dec/23

Status: Backlog
Project: Core Server
Component/s: Text Search
Affects Version/s: 3.2.10, 3.3.15, 3.4.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kemal Ogun Isik Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 4
Labels: qi-text-search, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Integration
Operating System: ALL
Participants:

 Description   

The text index version 3 does not provide correct search result when the word contains "Turkish i" char.

Create Script

db.turk.drop()
db.turk.insert({ _id: "small_dotless", t1 : "quıt" })
db.turk.insert({ _id: "small_dot", t1 : "quit" })
db.turk.insert({ _id: "big_dotless", t1 : "QUIT" })
db.turk.insert({ _id: "big_dot", t1 : "QUİT" })
 
db.turk.ensureIndex({t1: "text"}, {
    default_language: "turkish",
    name: "TextIndex"
});

Actual Results

> db.turk.find({$text: {$search: "quit", $language: "tr", $caseSensitive: false, $diacriticSensitive: false}});
{"_id" : "big_dot", "t1" : "QUİT"}
{"_id" : "small_dot", "t1" : "quit"}

Expected Results

> db.turk.find({$text: {$search: "quit", $language: "tr", $caseSensitive: false, $diacriticSensitive: false}});
{"_id" : "small_dotless", "t1" : "quıt"}
{"_id" : "small_dot", "t1" : "quit"}
{"_id" : "big_dotless", "t1" : "QUIT"}
{"_id" : "big_dot", "t1" : "QUİT"}



 Comments   
Comment by Kyle Suarez [ 15/May/17 ]

Hello ogunisik,

Apologies for the delay, and for the frustration with the Turkish diacritic bug. The bug is rather tricky and has no obvious solution, and as I mentioned in my previous comment, it may require a significant rewrite of our codepoint-to-codepoint transformation algorithm. In addition, if that turns out to be required, we may have to bump the text index version. Both of these points have led us to put this ticket on the Backlog, meaning that it is currently not scheduled to be worked on.

Sorry again for the inconvenience.

Regards,
Kyle

Comment by Kemal Ogun Isik [ 07/May/17 ]

Hello,

Is there any plan to solve this issue? We are still waiting....

Comment by Kyle Suarez [ 07/Dec/16 ]

I don't know if I've pinpointed the exact underlying cause, but I have some suspicions. unicode::codepointToLower() has special behavior for CaseFoldMode::kTurkish, but unicode::codepointRemoveDiacritics() isn't Turkish-aware. I would assume that, in a Turkish diacritic-insensitive setting, i maps to ı; that is, 0x69 maps to 0x131. However, 0x69 isn't handled as a case in the giant switch statement.

An offline chat with redbeard0531 implied that our codepoint-to-codepoint transformation algorithm can't handle the Turkish i properly, but I'm not sure exactly where to look further.

Comment by Ramon Fernandez Marina [ 17/Oct/16 ]

Thanks for your report ogunisik, the Query team is going to look into this issue.

Generated at Thu Feb 08 04:12:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.