Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17620

RLP Tokenizer (includes C++ unit tests)

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.1.2
    • Affects Version/s: None
    • Component/s: Text Search
    • Labels:
      None
    • Fully Compatible
    • Platform 2 04/24/15

      Implement a derived class of FTSTokenizer that uses the Basis Tech Rosette Linguistics API to tokenize documents.

      Below is the table of languages we will add. Included in the table below is the official ISO language identifiers from various ISO standards. We will use ISO-639-3 codes for these new languages as ISO-639-1 identifiers are two letters and cannot discriminate between languages in certain language families (ie, Farsi).

      For Chinese, Simplified, and Traditional are not language dialects, but script dialects so we use a combination of the RLP names (zhs, Simplified Chinese), and the official ISO 15924 name (Hant, note the identifier is title cased in the ISO spec).

      ISO Definitions:

      • ISO-639-1 - Two Letter Codes - Codes for the representation of names of languages
      • ISO-639-3 - Three Letter Codes - Codes for the representation of names of languages
      • ISO 15924 - Codes for the representation of names of scripts
      Language ISO-639-1 ISO-639-3 RLP MongoDB RLP Language Code
      Arabic ar ara ara ara,arabic BT_LANGUAGE_ARABIC
      Dari fa prs prs prs,dari BT_LANGUAGE_DARI
      Farsi (Persian) fa pes pes pes,iranian persian BT_LANGUAGE_WESTERN_FARSI
      Urdu ur urd urd urd,urdu BT_LANGUAGE_URDU
      Simplified Chinese N/A N/A zhs zhs,hans,simplified chinese BT_LANGUAGE_SIMPLIFIED_CHINESE
      Traditional Chinese N/A N/A zht zht,hant,traditional chinese BT_LANGUAGE_TRADITIONAL_CHINESE

            Assignee:
            mark.benvenuto@mongodb.com Mark Benvenuto
            Reporter:
            mark.benvenuto@mongodb.com Mark Benvenuto
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: