[SERVER-17620] RLP Tokenizer (includes C++ unit tests) Created: 16/Mar/15  Updated: 05/Dec/16  Resolved: 16/Apr/15

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: 3.1.2

Type: Task Priority: Major - P3
Reporter: Mark Benvenuto Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-13709 Add text index support for arabic Closed
is depended on by SERVER-17595 Add support for Persian language in t... Closed
is depended on by SERVER-8962 increasing Chinese support of text index Closed
Documented
is documented by DOCS-9546 Docs for SERVER-17620: RLP Tokenizer ... Closed
Duplicate
is duplicated by SERVER-13709 Add text index support for arabic Closed
is duplicated by SERVER-17595 Add support for Persian language in t... Closed
Tested
Backwards Compatibility: Fully Compatible
Sprint: Platform 2 04/24/15
Participants:

 Description   

Implement a derived class of FTSTokenizer that uses the Basis Tech Rosette Linguistics API to tokenize documents.

Below is the table of languages we will add. Included in the table below is the official ISO language identifiers from various ISO standards. We will use ISO-639-3 codes for these new languages as ISO-639-1 identifiers are two letters and cannot discriminate between languages in certain language families (ie, Farsi).

For Chinese, Simplified, and Traditional are not language dialects, but script dialects so we use a combination of the RLP names (zhs, Simplified Chinese), and the official ISO 15924 name (Hant, note the identifier is title cased in the ISO spec).

ISO Definitions:

  • ISO-639-1 - Two Letter Codes - Codes for the representation of names of languages
  • ISO-639-3 - Three Letter Codes - Codes for the representation of names of languages
  • ISO 15924 - Codes for the representation of names of scripts
Language ISO-639-1 ISO-639-3 RLP MongoDB RLP Language Code
Arabic ar ara ara ara,arabic BT_LANGUAGE_ARABIC
Dari fa prs prs prs,dari BT_LANGUAGE_DARI
Farsi (Persian) fa pes pes pes,iranian persian BT_LANGUAGE_WESTERN_FARSI
Urdu ur urd urd urd,urdu BT_LANGUAGE_URDU
Simplified Chinese N/A N/A zhs zhs,hans,simplified chinese BT_LANGUAGE_SIMPLIFIED_CHINESE
Traditional Chinese N/A N/A zht zht,hant,traditional chinese BT_LANGUAGE_TRADITIONAL_CHINESE


 Comments   
Comment by Githook User [ 16/Apr/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-17620 RLP Language Support
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/caefbd3def87f28671f9be009f19c32f7f11aa40

Comment by Githook User [ 16/Apr/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-17620 RLP Tokenizer
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/ec09840f55f320be7f1b7daf412bb3c402158796

Comment by Githook User [ 16/Apr/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-17620 RLP Tokenizer
Branch: master
https://github.com/mongodb/mongo/commit/b0b78e017d4e503fb347962228000f46adad5b39

Generated at Thu Feb 08 03:45:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.