Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-10976

Clarify diacritic insensitivity for text indexes

      Currently, the documentation for text indexes has this to say about diacritic insensitivity:

      With version 3, text index is diacritic insensitive. That is, the index does not distinguish between characters that contain diacritical marks and their non-marked counterpart, such as é, ê, and e. More specifically, the text index strips the characters categorized as diacritics in Unicode 8.0 Character Database Prop List.

      This is mostly correct but our algorithm "misses" some cases because of some oddities in the Unicode standard. Our algorithm takes a Unicode codepoint (say Ç), uses a standard decomposition algorithm to break it down (in this case, C + ◌̧), and then removes diacritic characters (the combining cedilla character). However, there are letters like Ł and Ø that simply decompose to themselves in the standard decomposition algorithm and so do not get transformed into L and O, respectively. See the section "Non-decomposition of Certain Diacritics" in section 2.12 of the current Unicode standard.

      I think it would be worth clarifying how this diacritic stripping is performed to avoid confusion; see SERVER-31152.

            Assignee:
            Unassigned Unassigned
            Reporter:
            kyle.suarez@mongodb.com Kyle Suarez
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              1 year, 25 weeks, 2 days ago