[SERVER-13998] Support for language constrained search Created: 20/May/14  Updated: 28/Dec/23

Status: Backlog
Project: Core Server
Component/s: Index Maintenance, Text Search
Affects Version/s: 2.6.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: A Mare Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 0
Labels: qi-text-search
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-8988 Text indexes should partition index e... Open
Assigned Teams:
Query Integration
Participants:

 Description   

The MongoDB text search functionality became quite flexible in version 2.6 in terms of language specification in documents (and subdocuments). It is also possible (and advisable) to specify the language of the looked-after words when performing the search.

What is still missing is the possibility to limit the resulting documents in terms of original language of found stems. If we have a collection holding documents with text-indexed fields in various languages, searching for some words in language A may very well return documents that matched the query through stems collected from words in language B (i.e. with a totally different meaning). There are quite a few cases where language separation is not only advisable, but also required.

Example:
Pommes has different meanings in German ("French fries") and French ("apples"). If you look for

{$text: {$search: "pommes", $language: "fr"}}

the search will produce documents referring to French fries, which is not what we intended.

Suggestion:
Create a new boolean parameter, say $restrictLanguage, with a default value of false, which indicates that resulting documents must have the matching stems from the same language as the search words.

{$text: {$search: "pommes", $language: "fr", $restrictLanguage: true} }

Of course, this would imply that the text indexes will have to also store the original language information for all collected stems (actually, for all their occurrences).

As a side note, this would bring you closer to the Google's web search option to look into pages of a certain language only.

Current workaround:
The developers must separate all text contents in different collections per language, maintaing manually the relationship/synchronization to the original collection (yes, this is the drawback of non-relational databases!). This separation is necessary to circumvent the constraint that no more than 1 text index can be defined for a collection...



 Comments   
Comment by A Mare [ 20/May/14 ]

Well, I do not see any duplication here. SERVER-8988 speaks about different results based on the $language specified along with the search words, which quite normal in my opinion (to be honest, I don't see the point of SERVER-8988).

I'm talking here about a completely different issue, which needs indeed language information attached to the indexed stems.

Generated at Thu Feb 08 03:33:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.