Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13998

Support for language constrained search

    • Query Integration

      The MongoDB text search functionality became quite flexible in version 2.6 in terms of language specification in documents (and subdocuments). It is also possible (and advisable) to specify the language of the looked-after words when performing the search.

      What is still missing is the possibility to limit the resulting documents in terms of original language of found stems. If we have a collection holding documents with text-indexed fields in various languages, searching for some words in language A may very well return documents that matched the query through stems collected from words in language B (i.e. with a totally different meaning). There are quite a few cases where language separation is not only advisable, but also required.

      Example:
      Pommes has different meanings in German ("French fries") and French ("apples"). If you look for

      {$text: {$search: "pommes", $language: "fr"}}

      the search will produce documents referring to French fries, which is not what we intended.

      Suggestion:
      Create a new boolean parameter, say $restrictLanguage, with a default value of false, which indicates that resulting documents must have the matching stems from the same language as the search words.

      {$text: {$search: "pommes", $language: "fr", $restrictLanguage: true} }

      Of course, this would imply that the text indexes will have to also store the original language information for all collected stems (actually, for all their occurrences).

      As a side note, this would bring you closer to the Google's web search option to look into pages of a certain language only.

      Current workaround:
      The developers must separate all text contents in different collections per language, maintaing manually the relationship/synchronization to the original collection (yes, this is the drawback of non-relational databases!). This separation is necessary to circumvent the constraint that no more than 1 text index can be defined for a collection...

            Assignee:
            backlog-query-integration [DO NOT USE] Backlog - Query Integration
            Reporter:
            amare A Mare
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: