[SERVER-8988] Text indexes should partition index entries by language Created: 14/Mar/13  Updated: 22/Mar/23

Status: Open
Project: Core Server
Component/s: Text Search
Affects Version/s: 2.4.0-rc3
Fix Version/s: features we're not sure of

Type: Improvement Priority: Major - P3
Reporter: Mike Dransfield Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-13998 Support for language constrained search Backlog
Assigned Teams:
Query Integration
Participants:

 Description   

A text search for an english word with language set to russian returns english results, searches with the same keyword but language set to spanish return no results

> db.foo.insert({text:"hello world", language:"english"})
> db.foo.ensureIndex({text:"text"})
> db.foo.runCommand("text",{search:"hello",language:"english"})
{
	"queryDebugString" : "hello||||||",
	"language" : "english",
	"results" : [
		{
			"score" : 0.75,
			"obj" : {
				"_id" : ObjectId("51435cd6141e7117a6ca8092"),
				"text" : "hello world",
				"language" : "english"
			}
		}
	],
	"stats" : {
		"nscanned" : 1,
		"nscannedObjects" : 0,
		"n" : 1,
		"nfound" : 1,
		"timeMicros" : 344
	},
	"ok" : 1
}
> db.foo.runCommand("text",{search:"hello",language:"spanish"})
{
	"queryDebugString" : "hell||||||",
	"language" : "spanish",
	"results" : [ ],
	"stats" : {
		"nscanned" : 0,
		"nscannedObjects" : 0,
		"n" : 0,
		"nfound" : 0,
		"timeMicros" : 2383
	},
	"ok" : 1
}
> db.foo.runCommand("text",{search:"hello",language:"russian"})
{
	"queryDebugString" : "hello||||||",
	"language" : "russian",
	"results" : [
		{
			"score" : 0.75,
			"obj" : {
				"_id" : ObjectId("51435cd6141e7117a6ca8092"),
				"text" : "hello world",
				"language" : "english"
			}
		}
	],
	"stats" : {
		"nscanned" : 1,
		"nscannedObjects" : 0,
		"n" : 1,
		"nfound" : 1,
		"timeMicros" : 243
	},
	"ok" : 1
}
>



 Comments   
Comment by A Mare [ 20/May/14 ]

The issue reported here is perfectly explainable. Remember that the stemming is mechanical (algorithmic) and has little to do with the actual meaning of the word in any language.
The stem of "hello" is as follows:

  • "hello" in English (and the index will retain hello as an entry)
  • "hell" in Spanish
  • "hello" in Russian

It's easy to see that the Spanish search won't find hell in the index, but Mongo will find hello when using the Russian stemming rules (which, by the way, have nothing to do with the Latin alphabet ).

Comment by J Rassi [ 15/Mar/13 ]

I accidentally removed your reproduction of the issue when editing the ticket (JIRA automatically deletes the "Steps to Reproduce" field when "Issue Type" is changed from "Bug" to "Improvement"), I'm sorry. I've added back a representative case to the "Description" field.

Comment by J Rassi [ 15/Mar/13 ]

This is a limitation of the current implementation: when terms are added to the index, language information is not persisted. Thus, for example, whether you see an English result in a non-English search is not defined; it depends on the implementation behavior of the two relevant stemming algorithms.

I've edited the title of this ticket to turn it into a feature request. We may address this in a future release.

Generated at Thu Feb 08 03:19:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.