Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9390

Multi-language support for text search

    • Type: Icon: New Feature New Feature
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.5.3
    • Affects Version/s: 2.4.0
    • Component/s: Text Search
    • Labels:
      None

      Text Search - Multi-Language Documents

      Objective

      The objective is for FTS to support documents containing multiple languages. This means that correct stemmers and stopword lists are applied depending on the language specification on a per-subdocument basis.

      External Specification

      We allow documents to contain nested “language” specifiers. The same override field name applies at every level in a document (i.e.) you can use a name other than “language”, but you have to use the same name everywhere. The format is minimal:

      { ...

      language: “spanish”,

      ...

      subdoc :

      { language : “portuguese”, ... }

      ,

      ...

      }

      Implementation

      All the work of generating scored terms for the FTS index is concentrated in the method FTSSpec::scoreDocument. The current algorithm determines the language binding from the value of the field “language” (or the override) in the given document:

      BSONElement e = userDoc[_languageOverrideField];

      => e.valuestrsafe()

      Then the algorithm loops over the index field specifier and then extracts language-specific (term,score) pairs for the FTS indexer plugin.

      In order to make this work with subdocument language specifiers, we need to allow every field in the weights vector (index field specifiers) to have a possibly different associated language as determined by the given document. In the loop (in FTSSpec::scoreDocument):

      Weights::const_iterator i;

      for ( i = _weights.begin(); i != _weights.end(); i++ ) {

      // name of field

      const char * leftOverName = i->first.c_str();

      BSONElement e = obj.getFieldDottedOrArray(leftOverName);

      ...

      We could check for a local language specifier in the element e, working outward through progressively shorter prefixes of the dotted name. If the name is not dotted, no language change occurs.

      Alternatively, we could rewrite the loop, and descend recursively into the given document, stacking “language” field values and checking index field names from some quick lookup table. Such an re-implementation would be efficient, since it requires worst case exactly one pass through the given document.

      The implementation of operator[](const char* field) in BSONObj iterates across the BSON object and returns the first matching field. Modeling the m text fields as randomly located in the document, the i-th field at iteration position f_i. The existing implementation of FTSSpec::scoreDocument requires sum_

      {i = 1..m}

      f_i steps. A single traversal of the document checking for text fields would require m + max(f_i) steps (assuming that checking the text field set is O(1), and it requires m steps to create a hashtable).

            Assignee:
            rassi J Rassi
            Reporter:
            paul.pedersen Paul Pedersen
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: