[DOCS-10976] Clarify diacritic insensitivity for text indexes Created: 02/Nov/17  Updated: 30/Oct/23

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: Server_Docs_20231030

Type: Improvement Priority: Major - P3
Reporter: Kyle Suarez Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-31152 Wrong diacritics for polish letter ł Closed
Participants:
Days since reply: 1 year, 14 weeks, 2 days ago
Epic Link: DOCSP-1769
Story Points: 0.4

 Description   

Currently, the documentation for text indexes has this to say about diacritic insensitivity:

With version 3, text index is diacritic insensitive. That is, the index does not distinguish between characters that contain diacritical marks and their non-marked counterpart, such as é, ê, and e. More specifically, the text index strips the characters categorized as diacritics in Unicode 8.0 Character Database Prop List.

This is mostly correct but our algorithm "misses" some cases because of some oddities in the Unicode standard. Our algorithm takes a Unicode codepoint (say Ç), uses a standard decomposition algorithm to break it down (in this case, C + ◌̧), and then removes diacritic characters (the combining cedilla character). However, there are letters like Ł and Ø that simply decompose to themselves in the standard decomposition algorithm and so do not get transformed into L and O, respectively. See the section "Non-decomposition of Certain Diacritics" in section 2.12 of the current Unicode standard.

I think it would be worth clarifying how this diacritic stripping is performed to avoid confusion; see SERVER-31152.



 Comments   
Comment by Kyle Suarez [ 31/Oct/22 ]

I did check the latest Unicode standard as of today (apparently we are up to 15 now) and the text in section 2.12: "Non-decomposition of Certain Diacritics" is still the same, so at least we know the behavior is stable.

Comment by Sarah Olson [ 31/Oct/22 ]

Thanks for the confirm, kyle.suarez@mongodb.com!

Comment by Kyle Suarez [ 31/Oct/22 ]

When I opened this five years ago I thought clarifying this would be important; now I think it's probably too much in the weeds? Probably best this stays closed unless some customer complained about it.

Comment by Education Bot [ 31/Oct/22 ]

Hello! This ticket has been closed due to inactivity. If you believe this ticket is still important, please reopen it and leave a comment to explain why. Thank you!

Generated at Thu Feb 08 08:01:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.