[SERVER-9390] Multi-language support for text search Created: 18/Apr/13 Updated: 07/Feb/14 Resolved: 12/Oct/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | 2.4.0 |
| Fix Version/s: | 2.5.3 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Paul Pedersen | Assignee: | J Rassi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
Text Search - Multi-Language Documents Objective The objective is for FTS to support documents containing multiple languages. This means that correct stemmers and stopword lists are applied depending on the language specification on a per-subdocument basis. External Specification We allow documents to contain nested “language” specifiers. The same override field name applies at every level in a document (i.e.) you can use a name other than “language”, but you have to use the same name everywhere. The format is minimal: { ... language: “spanish”, ... subdoc : { language : “portuguese”, ... }, ... } Implementation All the work of generating scored terms for the FTS index is concentrated in the method FTSSpec::scoreDocument. The current algorithm determines the language binding from the value of the field “language” (or the override) in the given document: BSONElement e = userDoc[_languageOverrideField]; => e.valuestrsafe() Then the algorithm loops over the index field specifier and then extracts language-specific (term,score) pairs for the FTS indexer plugin. In order to make this work with subdocument language specifiers, we need to allow every field in the weights vector (index field specifiers) to have a possibly different associated language as determined by the given document. In the loop (in FTSSpec::scoreDocument): Weights::const_iterator i; for ( i = _weights.begin(); i != _weights.end(); i++ ) { // name of field const char * leftOverName = i->first.c_str(); BSONElement e = obj.getFieldDottedOrArray(leftOverName); ... We could check for a local language specifier in the element e, working outward through progressively shorter prefixes of the dotted name. If the name is not dotted, no language change occurs. Alternatively, we could rewrite the loop, and descend recursively into the given document, stacking “language” field values and checking index field names from some quick lookup table. Such an re-implementation would be efficient, since it requires worst case exactly one pass through the given document. The implementation of operator[](const char* field) in BSONObj iterates across the BSON object and returns the first matching field. Modeling the m text fields as randomly located in the document, the i-th field at iteration position f_i. The existing implementation of FTSSpec::scoreDocument requires sum_ {i = 1..m}f_i steps. A single traversal of the document checking for text fields would require m + max(f_i) steps (assuming that checking the text field set is O(1), and it requires m steps to create a hashtable). |
| Comments |
| Comment by auto [ 12/Oct/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by auto [ 12/Oct/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: FTSIndexFormat::getKeys() now desends into subdocuments to find | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Shane R. Spencer [ 15/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm looking forward to having the following scenario implemented to allow for some very multi-language oriented documents. I'm sure it's not a simple matter however.
Of course the following would be a test case as well:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Paul Pedersen [ 11/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
thx. yes, i've implemented sub-document language support. i'll need to -p | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Will Shaver [ 10/May/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Paul - When at MongoDb Days SF, I discussed the need for sub-document scores with Eliot Horowitz. He said I should create a ticket as you were working on multi-language support for full-text search and that sub-document ranking might be related enough to include in this cycle. https://jira.mongodb.org/browse/SERVER-9652 Thanks! |