[SERVER-25682] Relax collation locale string validation Created: 18/Aug/16 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.3.11 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Derick Rethans | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | collation, query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query Execution
|
||||||||
| Participants: | |||||||||
| Description |
|
I started playing around with MongoDB 3.3.11's locale support, and ran into a few things that I had not expected. In most places, using the ICU locate strings in the form "language_COUNTRYCODE, would be how you would specify that. The ICU documentation at (http://userguide.icu-project.org/locale) is full of such things. Therefore, I had expected all of the following to work:
As you can see, the only one with the language_COUNTRYCODE combination that worked, was ```fr_CA``. Sometimes it recommended me an alternative one ("fr_FR" -> "fr", "nn_NO" -> "nn"), although fr_FR and nn_NO should IMO also have been accepted. Additionally, is the locale "nl" not supported at all? Dutch has several interesting sorting issues revolving around "ij". (It sorts between "i" and "j"): http://demo.icu-project.org/icu-bin/locexp?_=nl |
| Comments |
| Comment by David Storch [ 18/Aug/16 ] | ||||||
|
As reflected in the new title, let's make this ticket specifically about the overly rigid locale id validation (i.e. requiring "fr" instead of "fr_FR"). Adding support for new locales should be handled separately, since the engineering work required from our team is entirely separate. | ||||||
| Comment by David Storch [ 18/Aug/16 ] | ||||||
|
Recent 3.3.x versions earlier than 3.3.13 do not support "nl" or "nl_NL" since we did not package the data necessary for this collator into the server binary. The list of supported locales is available here: Once our documentation for this feature becomes public, we will have this list available in the MongoDB 3.4 manual. | ||||||
| Comment by Derick Rethans [ 18/Aug/16 ] | ||||||
I would agree with that. It's very common for people to always use the language_COUNTRY variant. I do not believe it is different from "fr" in this case though. However, I also pointed out:
"nl" (or "nl_NL") is a valid local by ICU - it's locale browser happy shows language specific information for it (like the character "set" and letter ordering). | ||||||
| Comment by David Storch [ 18/Aug/16 ] | ||||||
|
This is an artifact of how we currently validate the locale string. We pass the locale string through to ICU, and if it falls back on a different locale string, then we throw an "invalid locale" error. This is to make sure that you don't pass garbage. ICU is perfectly happy to use the French locale for "fr_GARBAGE". In the case of "fr_FR", I believe ICU falls back to just "fr", which is why we make the "fr_FR" => "fr" suggestion. What I'm not sure of is whether the "fr_FR" collator is behaviorally different than the "fr" collator, and therefore unsupported, whether we should accept "fr_FR" as an alternative spelling for "fr", or whether we should reject "fr_FR" because it's a non-normalized spelling of "fr". I suspect that what we want is the second option (accept "fr_FR" as an alternative spelling for "fr"). |