[DOCS-9546] Docs for SERVER-17620: RLP Tokenizer (includes C++ unit tests) Created: 05/Dec/16  Updated: 21/Jan/18  Resolved: 21/Jan/18

Status: Closed
Project: Documentation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Emily Hall Assignee: Kay Kim (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-17620 RLP Tokenizer (includes C++ unit tests) Closed
Participants:
Days since reply: 6 years, 3 weeks, 4 days ago
Epic Link: PM-37

 Description   

Engineering Ticket Description:

Implement a derived class of FTSTokenizer that uses the Basis Tech Rosette Linguistics API to tokenize documents.

Below is the table of languages we will add. Included in the table below is the official ISO language identifiers from various ISO standards. We will use ISO-639-3 codes for these new languages as ISO-639-1 identifiers are two letters and cannot discriminate between languages in certain language families (ie, Farsi).

For Chinese, Simplified, and Traditional are not language dialects, but script dialects so we use a combination of the RLP names (zhs, Simplified Chinese), and the official ISO 15924 name (Hant, note the identifier is title cased in the ISO spec).

ISO Definitions:

  • ISO-639-1 - Two Letter Codes - Codes for the representation of names of languages
  • ISO-639-3 - Three Letter Codes - Codes for the representation of names of languages
  • ISO 15924 - Codes for the representation of names of scripts
Language ISO-639-1 ISO-639-3 RLP MongoDB RLP Language Code
Arabic ar ara ara ara,arabic BT_LANGUAGE_ARABIC
Dari fa prs prs prs,dari BT_LANGUAGE_DARI
Farsi (Persian) fa pes pes pes,iranian persian BT_LANGUAGE_WESTERN_FARSI
Urdu ur urd urd urd,urdu BT_LANGUAGE_URDU
Simplified Chinese N/A N/A zhs zhs,hans,simplified chinese BT_LANGUAGE_SIMPLIFIED_CHINESE
Traditional Chinese N/A N/A zht zht,hant,traditional chinese BT_LANGUAGE_TRADITIONAL_CHINESE


 Comments   
Comment by Kay Kim (Inactive) [ 21/Jan/18 ]

Was already done a year before this ticket was created.

Generated at Thu Feb 08 07:58:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.