Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Done
Priority: Major - P3
Fix Version/s: 2.5.3
Affects Version/s: 2.4.0
Component/s: Text Search
Labels:
None

Text Search - Multi-Language Documents

Objective

The objective is for FTS to support documents containing multiple languages. This means that correct stemmers and stopword lists are applied depending on the language specification on a per-subdocument basis.

External Specification

We allow documents to contain nested “language” specifiers. The same override field name applies at every level in a document (i.e.) you can use a name other than “language”, but you have to use the same name everywhere. The format is minimal:

{ ...

language: “spanish”,

...

subdoc :

{ language : “portuguese”, ... }

...

}

Implementation

All the work of generating scored terms for the FTS index is concentrated in the method FTSSpec::scoreDocument. The current algorithm determines the language binding from the value of the field “language” (or the override) in the given document:

BSONElement e = userDoc[_languageOverrideField];

=> e.valuestrsafe()

Then the algorithm loops over the index field specifier and then extracts language-specific (term,score) pairs for the FTS indexer plugin.

In order to make this work with subdocument language specifiers, we need to allow every field in the weights vector (index field specifiers) to have a possibly different associated language as determined by the given document. In the loop (in FTSSpec::scoreDocument):

Weights::const_iterator i;

for ( i = _weights.begin(); i != _weights.end(); i++ ) {

// name of field

const char * leftOverName = i->first.c_str();

BSONElement e = obj.getFieldDottedOrArray(leftOverName);

...

We could check for a local language specifier in the element e, working outward through progressively shorter prefixes of the dotted name. If the name is not dotted, no language change occurs.

Alternatively, we could rewrite the loop, and descend recursively into the given document, stacking “language” field values and checking index field names from some quick lookup table. Such an re-implementation would be efficient, since it requires worst case exactly one pass through the given document.

The implementation of operator[](const char* field) in BSONObj iterates across the BSON object and returns the first matching field. Modeling the m text fields as randomly located in the document, the i-th field at iteration position f_i. The existing implementation of FTSSpec::scoreDocument requires sum_

{i = 1..m}

f_i steps. A single traversal of the document checking for text fields would require m + max(f_i) steps (assuming that checking the text field set is O(1), and it requires m steps to create a hashtable).

has to be done before

SERVER-10906 Support for legacy text index format textIndexVersion:1

Closed

is depended on by

DOCS-1937 Document : Multi-language support for text search

Closed

is duplicated by

SERVER-8137 Text indexing/search with embedded language support

Closed

Assignee:: J Rassi

Reporter:: Paul Pedersen

Participants:: auto, J Rassi, Paul Pedersen, Shane R. Spencer, Will Shaver

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: Apr 18 2013 04:25:27 PM UTC

Updated:: Feb 07 2014 11:18:58 PM UTC

Resolved:: Oct 12 2013 05:10:35 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates