[SERVER-8423] Text search case folding needs utf-8 support Created: 31/Jan/13  Updated: 05/Dec/16  Resolved: 11/Aug/15

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 2.3.2
Fix Version/s: 3.1.7

Type: Improvement Priority: Major - P3
Reporter: J Rassi Assignee: Adam Chelminski (Inactive)
Resolution: Done Votes: 18
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-19557 Create Text Index v3 Closed
Documented
is documented by DOCS-9550 Docs for SERVER-8423: Text search cas... Closed
Duplicate
is duplicated by SERVER-17165 Full Text returns wrong results for T... Closed
is duplicated by SERVER-9367 toLowerCase() function does not work ... Closed
Related
Backwards Compatibility: Major Change
Sprint: Platform 6 07/17/15, Platform 8 08/28/15, Platform 7 08/10/15
Participants:

 Description   

e.g. for Russian queries, "Как" currently lowercases to itself, whereas it should lowercase to "как".

Needed for stopword removal, matching, etc.

> db.foo.insert({content:"Как дела?"})
> db.foo.ensureIndex({content:"text"},{default_language:"russian"})
> db.foo.runCommand("text",{search:"\"как дела\""})
{
	"queryDebugString" : "дел||||как дела||",
	"language" : "russian",
	"results" : [ ],
	"stats" : {
		"nscanned" : 0,
		"nscannedObjects" : 0,
		"n" : 0,
		"nfound" : 0,
		"timeMicros" : 104
	},
	"ok" : 1
}
> db.foo.runCommand("text",{search:"\"Как дела\""})
{
	"queryDebugString" : "Как|дел||||Как дела||",
	"language" : "russian",
	"results" : [
		{
			"score" : 1,
			"obj" : {
				"_id" : ObjectId("510aa82ddb47733460b47eff"),
				"content" : "Как дела?"
			}
		}
	],
	"stats" : {
		"nscanned" : 1,
		"nscannedObjects" : 0,
		"n" : 1,
		"nfound" : 1,
		"timeMicros" : 118
	},
	"ok" : 1
}
> 



 Comments   
Comment by Adam Chelminski (Inactive) [ 11/Aug/15 ]

This fix is integrated into text index v3, which will not work at all with previous versions of MongoDB.

Comment by Agrumas [X] [ 18/Jul/15 ]

I am very excited about upcoming fulltext-search improvements, thanks @adam.chelminski

Comment by Pavel Chertorogov [ 29/Jun/15 ]

UP!!!

2 years open this problem with UTF-8 collation
In 2010 year it was normal, but in 2015 it is abnormally.

Comment by Andrey Yurchenkov [ 07/May/15 ]

Russian lang, version 3.0.1
need this feature!!!!!

Comment by Matt Kangas [ 03/Feb/15 ]

SERVER-9367 and SERVER-17165 are the same underlying problem, but for Turkish.

Comment by Alexander Black [X] [ 08/Dec/14 ]

Version 2.6.5 same problem

Comment by Nikita Dedik [ 26/Nov/14 ]

Same problem here, Russian language.

Manual says: "If the index language is English, text indexes are case-insensitive for non-diacritics; i.e. case insensitive for [A-z]." - why is it so, in the end?

Comment by Roman [ 06/Feb/14 ]

This bug is very critical for russian users of mongodb!

Comment by Valentin Kostadinov [ 13/Jan/14 ]

Looking forward to having this bug fixed. Currently, I'm creating a "normalized" field (simply doing the lower case folding myself) and then creating the text index on that field. It's a bit of a pain point. Should be really easy to fix at least for the major languages. Cyrillic would be a big step forward and should be very easy to fix.

Comment by Cyrill [ 10/Dec/13 ]

Any news about this bug ?

Comment by J Rassi [ 31/Jan/13 ]

Case folding table available at <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>, regularly updated. Could process this similarly to processing of stopword lists (and then encoding as UTF-8).

Generated at Thu Feb 08 03:17:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.