-
Type: New Feature
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 2.4.0
-
Component/s: Text Search
-
None
Text Search - Multi-Language Documents
Objective
The objective is for FTS to support documents containing multiple languages. This means that correct stemmers and stopword lists are applied depending on the language specification on a per-subdocument basis.
External Specification
We allow documents to contain nested “language” specifiers. The same override field name applies at every level in a document (i.e.) you can use a name other than “language”, but you have to use the same name everywhere. The format is minimal:
{ ...
language: “spanish”,
...
subdoc :
{ language : “portuguese”, ... },
...
}
Implementation
All the work of generating scored terms for the FTS index is concentrated in the method FTSSpec::scoreDocument. The current algorithm determines the language binding from the value of the field “language” (or the override) in the given document:
BSONElement e = userDoc[_languageOverrideField];
=> e.valuestrsafe()
Then the algorithm loops over the index field specifier and then extracts language-specific (term,score) pairs for the FTS indexer plugin.
In order to make this work with subdocument language specifiers, we need to allow every field in the weights vector (index field specifiers) to have a possibly different associated language as determined by the given document. In the loop (in FTSSpec::scoreDocument):
Weights::const_iterator i;
for ( i = _weights.begin(); i != _weights.end(); i++ ) {
// name of field
const char * leftOverName = i->first.c_str();
BSONElement e = obj.getFieldDottedOrArray(leftOverName);
...
We could check for a local language specifier in the element e, working outward through progressively shorter prefixes of the dotted name. If the name is not dotted, no language change occurs.
Alternatively, we could rewrite the loop, and descend recursively into the given document, stacking “language” field values and checking index field names from some quick lookup table. Such an re-implementation would be efficient, since it requires worst case exactly one pass through the given document.
The implementation of operator[](const char* field) in BSONObj iterates across the BSON object and returns the first matching field. Modeling the m text fields as randomly located in the document, the i-th field at iteration position f_i. The existing implementation of FTSSpec::scoreDocument requires sum_
{i = 1..m}f_i steps. A single traversal of the document checking for text fields would require m + max(f_i) steps (assuming that checking the text field set is O(1), and it requires m steps to create a hashtable).
- has to be done before
-
SERVER-10906 Support for legacy text index format textIndexVersion:1
- Closed
- is duplicated by
-
SERVER-8137 Text indexing/search with embedded language support
- Closed