[SERVER-9390] Multi-language support for text search Created: 18/Apr/13  Updated: 07/Feb/14  Resolved: 12/Oct/13

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 2.4.0
Fix Version/s: 2.5.3

Type: New Feature Priority: Major - P3
Reporter: Paul Pedersen Assignee: J Rassi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by DOCS-1937 Document : Multi-language support for... Closed
Duplicate
is duplicated by SERVER-8137 Text indexing/search with embedded la... Closed
Gantt Dependency
has to be done before SERVER-10906 Support for legacy text index format ... Closed
Related
Participants:

 Description   

Text Search - Multi-Language Documents

Objective

The objective is for FTS to support documents containing multiple languages. This means that correct stemmers and stopword lists are applied depending on the language specification on a per-subdocument basis.

External Specification

We allow documents to contain nested “language” specifiers. The same override field name applies at every level in a document (i.e.) you can use a name other than “language”, but you have to use the same name everywhere. The format is minimal:

{ ...

language: “spanish”,

...

subdoc :

{ language : “portuguese”, ... }

,

...

}

Implementation

All the work of generating scored terms for the FTS index is concentrated in the method FTSSpec::scoreDocument. The current algorithm determines the language binding from the value of the field “language” (or the override) in the given document:

BSONElement e = userDoc[_languageOverrideField];

=> e.valuestrsafe()

Then the algorithm loops over the index field specifier and then extracts language-specific (term,score) pairs for the FTS indexer plugin.

In order to make this work with subdocument language specifiers, we need to allow every field in the weights vector (index field specifiers) to have a possibly different associated language as determined by the given document. In the loop (in FTSSpec::scoreDocument):

Weights::const_iterator i;

for ( i = _weights.begin(); i != _weights.end(); i++ ) {

// name of field

const char * leftOverName = i->first.c_str();

BSONElement e = obj.getFieldDottedOrArray(leftOverName);

...

We could check for a local language specifier in the element e, working outward through progressively shorter prefixes of the dotted name. If the name is not dotted, no language change occurs.

Alternatively, we could rewrite the loop, and descend recursively into the given document, stacking “language” field values and checking index field names from some quick lookup table. Such an re-implementation would be efficient, since it requires worst case exactly one pass through the given document.

The implementation of operator[](const char* field) in BSONObj iterates across the BSON object and returns the first matching field. Modeling the m text fields as randomly located in the document, the i-th field at iteration position f_i. The existing implementation of FTSSpec::scoreDocument requires sum_

{i = 1..m}

f_i steps. A single traversal of the document checking for text fields would require m + max(f_i) steps (assuming that checking the text field set is O(1), and it requires m steps to create a hashtable).



 Comments   
Comment by auto [ 12/Oct/13 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-9390 Bump textIndexVersion to 2
Branch: master
https://github.com/mongodb/mongo/commit/6a70c219c62990c3c350983fd57753a00d5dc69c

Comment by auto [ 12/Oct/13 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-9390 Text search support for multi-language documents

FTSIndexFormat::getKeys() now desends into subdocuments to find
language field to apply to the given subdocument.
Branch: master
https://github.com/mongodb/mongo/commit/bf0f29709b19565245be370aa3f8c46f0332de91

Comment by Shane R. Spencer [ 15/May/13 ]

I'm looking forward to having the following scenario implemented to allow for some very multi-language oriented documents. I'm sure it's not a simple matter however.

db.text.ensureIndex({
    'text.body': 'text'
}, {
    language_override: "text.language"
})
 
db.text.insert({
    'text': [{
            'language': 'english',
            'code': 'en_US',
            'body': 'howdy howdy howdy'
        }, {
            'language': 'spanish',
            'code': 'es_MX',
            'body': 'hola hola hola'
        }
    ]
})
 
db.text.runCommand( "text", { search: "hola", language: "spanish" } )
{
	"results" : [ ],
    ...
    "ok" : 1
}
 
db.text.runCommand( "text", { search: "hola" } )
{
	"results" : [
		{
			"score" : 1.75,
			"obj" : {
				"_id" : ObjectId("5193f38b5dcd189ac143047d"),
				"text" : [
					{
						"language" : "english",
						"code" : "en_US",
						"body" : "howdy howdy howdy"
					},
					{
						"language" : "spanish",
						"code" : "es_MX",
						"body" : "hola hola hola"
					}
				]
			}
		}
	],
    ....
    "ok" : 1
}

Of course the following would be a test case as well:

db.text.ensureIndex({
    'text.en_US.body': 'text',
    'text.es_MX.body': 'text'
}, {
    default_language: "english"
})
 
db.text.insert({
    'text': {
        'en_US': {
            'language': 'english',
            'body': 'howdy howdy howdy'
        },
        'es_MX': {
            'language': 'spanish',
            'body': 'hola hola hola'
        }
    }
})

Comment by Paul Pedersen [ 11/May/13 ]

thx. yes, i've implemented sub-document language support. i'll need to
consider how sub-document scoring might work.

-p

Comment by Will Shaver [ 10/May/13 ]

Paul -

When at MongoDb Days SF, I discussed the need for sub-document scores with Eliot Horowitz. He said I should create a ticket as you were working on multi-language support for full-text search and that sub-document ranking might be related enough to include in this cycle.

https://jira.mongodb.org/browse/SERVER-9652

Thanks!

Generated at Thu Feb 08 03:20:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.