[SERVER-8428] Text search tokens need to be unicode normalized Created: 31/Jan/13  Updated: 06/Dec/22  Resolved: 16/Aug/19

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 2.3.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: J Rassi Assignee: Backlog - Query Team (Inactive)
Resolution: Done Votes: 2
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query
Operating System: ALL
Participants:

 Description   

e.g.

Insert "café" twice (one with combining accent, one without):

>>> import pymongo
>>> testdb = pymongo.MongoClient()['test']
>>> doc1 = {'_id':1, 'content':u'caf\xe9'}
>>> doc2 = {'_id':2, 'content':u'cafe\u0301'}
>>> testdb.foo.insert(doc1)
1
>>> testdb.foo.insert(doc2)
2

But one doesn't get returned in search for the other:

> db.foo.ensureIndex({content:"text"})
> db.foo.find()
{ "_id" : 1, "content" : "café" }
{ "_id" : 2, "content" : "café" }
> s = db.foo.findOne({_id:1}).content
café
> db.foo.runCommand("text",{search:s})
{
	"queryDebugString" : "café||||||",
	"language" : "english",
	"results" : [
		{
			"score" : 1.1,
			"obj" : {
				"_id" : 1,
				"content" : "café"
			}
		}
	],
	"stats" : {
		"nscanned" : 1,
		"nscannedObjects" : 0,
		"n" : 1,
		"nfound" : 1,
		"timeMicros" : 104
	},
	"ok" : 1
}
> 



 Comments   
Comment by David Storch [ 16/Aug/19 ]

This issue does not exist for version 3 text indexes:

MongoDB Enterprise > db.c.drop()
true
MongoDB Enterprise > db.c.insert({_id: 1, content: "caf\xe9"})
WriteResult({ "nInserted" : 1 })
MongoDB Enterprise > db.c.insert({_id: 2, content: "cafe\u0301"})
WriteResult({ "nInserted" : 1 })
MongoDB Enterprise > db.c.createIndex({content: "text"})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"commitQuorum" : 1,
	"ok" : 1
}
MongoDB Enterprise > db.c.find({$text: {$search: "caf\xe9"}})
{ "_id" : 2, "content" : "café" }
{ "_id" : 1, "content" : "café" }
MongoDB Enterprise > db.c.find({$text: {$search: "cafe"}})
{ "_id" : 2, "content" : "café" }
{ "_id" : 1, "content" : "café" }

Text index v3 is diacritic and case insensitive, and correctly handles unicode normalization as shown above. Older text index versions still suffer from this problem:

MongoDB Enterprise > db.c.createIndex({content: "text"}, {textIndexVersion: 2})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"commitQuorum" : 1,
	"ok" : 1
}
MongoDB Enterprise > db.c.find({$text: {$search: "cafe"}})
MongoDB Enterprise > db.c.find({$text: {$search: "caf\xe9"}})
{ "_id" : 1, "content" : "café" }

We do not plan to improve this for older text versions, so I'm closing this issue as "Done" as part of introducing v3 text indexes.

Comment by Will Shaver [ 10/Apr/14 ]

Sure. https://jira.mongodb.org/browse/SERVER-13535 I love creating bugs it seems.

Comment by J Rassi [ 09/Apr/14 ]

I had this issue come up with Pokémon vs Pokemon.

Unicode normalization does not strip diacritic marks from Unicode strings (though feel free to file a separate ticket for that feature request!). The example in the description illustrates that the Unicode string "café" (where é is encoded as "LATIN SMALL LETTER E WITH ACUTE") should be considered equivalent to the Unicode string "café" (where é is encoded as "LATIN SMALL LETTER E" + "COMBINING ACUTE ACCENT"). See http://unicode.org/reports/tr15/ for a description of Unicode normalization.

Comment by Will Shaver [ 09/Apr/14 ]

Adding a couple keywords so this issue is easier to find: diacritical, diacritics, accents, accentuated.

I had this issue come up with Pokémon vs Pokemon.

Generated at Thu Feb 08 03:17:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.