[SERVER-25682] Relax collation locale string validation Created: 18/Aug/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: 3.3.11
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Derick Rethans Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: collation, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-26072 Add collation support for additional ... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

I started playing around with MongoDB 3.3.11's locale support, and ran into a few things that I had not expected. In most places, using the ICU locate strings in the form "language_COUNTRYCODE, would be how you would specify that. The ICU documentation at (http://userguide.icu-project.org/locale) is full of such things. Therefore, I had expected all of the following to work:

> db.test.createIndex( { a: 1 }, { collation: { locale: 'fr_FR', caseLevel: true, strength: 4 } } );
{
	"ok" : 0,
	"errmsg" : "Field 'locale' is invalid in: { locale: \"fr_FR\", caseLevel: true, strength: 4.0 }. Did you mean 'fr'?",
	"code" : 2
}
> db.test.createIndex( { a: 1 }, { collation: { locale: 'fr_CA', caseLevel: true, strength: 4 } } );
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> db.test.createIndex( { a: 1 }, { collation: { locale: 'nl_BE', caseLevel: true, strength: 4 } } );
{
	"ok" : 0,
	"errmsg" : "Field 'locale' is invalid in: { locale: \"nl_BE\", caseLevel: true, strength: 4.0 }",
	"code" : 2
}
> db.test.createIndex( { a: 1 }, { collation: { locale: 'nl_NL', caseLevel: true, strength: 4 } } );
{
	"ok" : 0,
	"errmsg" : "Field 'locale' is invalid in: { locale: \"nl_NL\", caseLevel: true, strength: 4.0 }",
	"code" : 2
}
> db.test.createIndex( { a: 1 }, { collation: { locale: 'nn_NO', caseLevel: true, strength: 4 } } );
{
	"ok" : 0,
	"errmsg" : "Field 'locale' is invalid in: { locale: \"nn_NO\", caseLevel: true, strength: 4.0 }. Did you mean 'nn'?",
	"code" : 2
}
> db.test.createIndex( { a: 1 }, { collation: { locale: 'nb_NO', caseLevel: true, strength: 4 } } );
{
	"ok" : 0,
	"errmsg" : "Field 'locale' is invalid in: { locale: \"nb_NO\", caseLevel: true, strength: 4.0 }",
	"code" : 2
}
> db.test.createIndex( { a: 1 }, { name: 'a_nl_simple', collation: { locale: 'nl' } } );
{
	"ok" : 0,
	"errmsg" : "Field 'locale' is invalid in: { locale: \"nl\" }",
	"code" : 2
}

As you can see, the only one with the language_COUNTRYCODE combination that worked, was ```fr_CA``. Sometimes it recommended me an alternative one ("fr_FR" -> "fr", "nn_NO" -> "nn"), although fr_FR and nn_NO should IMO also have been accepted.

Additionally, is the locale "nl" not supported at all? Dutch has several interesting sorting issues revolving around "ij". (It sorts between "i" and "j"): http://demo.icu-project.org/icu-bin/locexp?_=nl



 Comments   
Comment by David Storch [ 18/Aug/16 ]

As reflected in the new title, let's make this ticket specifically about the overly rigid locale id validation (i.e. requiring "fr" instead of "fr_FR"). Adding support for new locales should be handled separately, since the engineering work required from our team is entirely separate.

Comment by David Storch [ 18/Aug/16 ]

Recent 3.3.x versions earlier than 3.3.13 do not support "nl" or "nl_NL" since we did not package the data necessary for this collator into the server binary. The list of supported locales is available here:

https://github.com/mongodb/mongo/blob/master/src/third_party/icu4c-57.1/source/mongo_sources/languages.txt

Once our documentation for this feature becomes public, we will have this list available in the MongoDB 3.4 manual.

Comment by Derick Rethans [ 18/Aug/16 ]

I suspect that what we want is the second option (accept "fr_FR" as an alternative spelling for "fr").

I would agree with that. It's very common for people to always use the language_COUNTRY variant. I do not believe it is different from "fr" in this case though.

However, I also pointed out:

> db.test.createIndex( { a: 1 }, { name: 'a_nl_simple', collation: { locale: 'nl' } } );
{
	"ok" : 0,
	"errmsg" : "Field 'locale' is invalid in: { locale: \"nl\" }",
	"code" : 2
}

"nl" (or "nl_NL") is a valid local by ICU - it's locale browser happy shows language specific information for it (like the character "set" and letter ordering).

Comment by David Storch [ 18/Aug/16 ]

This is an artifact of how we currently validate the locale string. We pass the locale string through to ICU, and if it falls back on a different locale string, then we throw an "invalid locale" error. This is to make sure that you don't pass garbage. ICU is perfectly happy to use the French locale for "fr_GARBAGE". In the case of "fr_FR", I believe ICU falls back to just "fr", which is why we make the "fr_FR" => "fr" suggestion. What I'm not sure of is whether the "fr_FR" collator is behaviorally different than the "fr" collator, and therefore unsupported, whether we should accept "fr_FR" as an alternative spelling for "fr", or whether we should reject "fr_FR" because it's a non-normalized spelling of "fr". I suspect that what we want is the second option (accept "fr_FR" as an alternative spelling for "fr").

Generated at Thu Feb 08 04:09:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.