[SERVER-25683] Collation support does not reject invalid combinations of fields Created: 18/Aug/16  Updated: 19/Nov/16  Resolved: 30/Aug/16

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.3.11
Fix Version/s: 3.3.14

Type: Bug Priority: Major - P3
Reporter: Derick Rethans Assignee: David Storch
Resolution: Done Votes: 0
Labels: collation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2016-09-19
Participants:

 Description   

When creation an index with a locale, the server rejects things like this:

> db.test.createIndex( { a: 1 }, { name: 'en_complex', collation: { locale: 'en', strength: 8 } } );
{
	"ok" : 0,
	"errmsg" : "Field 'strength' must be an integer 1 through 5. Got: 8",
	"code" : 9
}

But it does not seem to reject the following:

> db.test.createIndex( { a: 1 }, { name: 'en_complex_1', collation: { locale: 'en', strength: 3, caseLevel: true } } );
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 3,
	"numIndexesAfter" : 4,
	"ok" : 1
}

A caseLevel definition only make sense for strength 1 or 2, as per "Optional fields" in the spec, with other levels it would just get ignored.

When reviewing the drivers specification, david.golden had flagged this. The consensus on the drivers side seemed to be that we should warn users about this "ignored field", but that this should be the responsibility of the server to throw an exception for – just like we rely on the server to that for readConcerns: https://github.com/mongodb/specifications/blob/master/source/read-write-concern/read-write-concern.rst#unknown-levels-and-additional-options-for-string-based-readconcerns



 Comments   
Comment by Githook User [ 30/Aug/16 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-25683 reject invalid combinations of collation options
Branch: master
https://github.com/mongodb/mongo/commit/ed0f26286c8a267cb4111a12fdb6d956a75ddaab

Comment by David Storch [ 29/Aug/16 ]

david.golden, LGTM! We should definitely incorporate something like this as we continue to revise the docs for the collation feature.

Comment by David Golden [ 29/Aug/16 ]

david.storch, after reviewing UTS #10 and UTS #35, I agree with your assessment of the options.

How about this as (verbose) documentation for caseLevel:

When caseLevel is true, collation adds a comparison level consisting only of case characteristics that affects collation differently by strength level:

  • For strength 1, the level is added after level 1 and distinguishes case regardless of accent
  • For strength 2, the level is added after level 2 and distinguishes case after taking accents into account
  • For strength > 2, the level is added after level 2 and distinguishes case before other level 3 characteristics
Comment by David Storch [ 26/Aug/16 ]

Here are the results of my thinking, option by option, about whether or not there are combinations that should be rejected:

  • locale and strength are always legal, so long as the locale is a recognized supported locale, and strength is an integer on the range [1, 5].
  • caseLevel is legal at any strength, per my comment above.
  • caseFirst affects only tertiary level or caseLevel comparisons. Therefore, if strength is 1 or 2 and caseLevel is off, then this option should be rejected.
  • numericOrdering is always legal.
  • alternate controls our handling of variable collation elements. Variable collation elements are those with low, but non-ignorable primary weights. (For details on the terminology I'm using here, see Unicode Technical Standard #10.) Since variable collation elements are always non-ignorable (in particular, non-ignorable at the primary level), this option impacts comparisons being made at any strength.
  • maxVariable determines which collation elements are considered variable collation elements. For the same reason as I described for alternate, this option also impacts comparisons at any strength.
  • normalization is always valid. If the input strings may not be in Unicode NFD (see UTS#15), then this option must be enabled in order to ensure that strings are converted into their canonically decomposed forms. If the input is known to be normalized, then this option can be turned off.
  • backwards, as I mentioned above, is not meaningful at strength 1.
Comment by David Storch [ 26/Aug/16 ]

My reading of the docs is that turning on caseLevel is meaningful at strengths greater than 2. Turning on this flag creates a new comparison level in between the secondary and tertiary levels which considers only case differences. I believe it is possible that some tailoring of the Unicode Collation Algorithm would involve tertiary differences other than case. If you would like case differences to be more significant than these other tertiary differences (or quaternary or higher differences, for that matter), then it would be meaningful to enable this option at strengths 3 or higher. I think the common use case would be to consider case differences at strength 1, but we shouldn't bar other use cases.

That said, I believe there are some combinations of options that we should reject. For example, the backwards option specifically means "backwards secondary weighting". It applies specifically at the secondary level, which means that it is not meaningful at strength 1. I'm going to try to come up with a list of all such combinations that should be rejected.

Comment by David Golden [ 18/Aug/16 ]

I think investigation is warranted. The docs here, here and here strongly imply that it adds case level considerations to the primary or secondary levels, which otherwise wouldn't consider case, without adding the other tertiary level considerations.

From my reading of the docs, caseLevel true and strength 3+ is redundant. Perhaps if we're very careful to document caseLevel similar to UTS#35, then we don't need to worry about contradictory instructions. I.e. saying "caseLevel true turns on case sensitivity" is wrong, but saying "caseLevel true modifies strength level 1 or 2 to include case differences in the collation order" would be correct and nudge users towards correct usage.

Comment by David Storch [ 18/Aug/16 ]

Great, glad we agree on the validation bit We will have to dig into the unicode collation algorithm a bit before we make this change. I currently don't understand it well enough to be certain that caseLevel plus strength > 2 is a meaningless combination for all languages. If it is, then I'm totally in agreement that the server should reject it. Plus, if we don't reject things like this in 3.4, then it will be harder for us to make this change down the line.

Comment by Derick Rethans [ 18/Aug/16 ]

david.storch "the drivers should not issue an error or warning in this case" ← That's what I said. We would like the server to raise errors for incompatible fields.

Comment by David Golden [ 18/Aug/16 ]

We're agreed. Drivers should not do any validation.

In the specific caseLevel example, I think the logic should be that the existence of a "caseLevel" field when strength > 2 should error.

Comment by David Storch [ 18/Aug/16 ]

david.golden derick, the drivers should not issue an error or warning in this case. Please do not introduce driver-side validation of the collation specification. It is reasonable to expect that the server does this validation. Since ICU accepts any combination of options, our integration with ICU is currently happy to accept any combination of options, even if the combination is not meaningful. At some point our team should do a more thorough review of the collation options to decide whether we can confidently say which combinations of options are illegal. That will be the work represented by this ticket.

Comment by David Golden [ 18/Aug/16 ]

I realize that ICU is happy to ignore this combination and just use strength. The argument from the drivers team is that allowing users to express contradictory desires leads to frustrated users and additional customer support questions.

We want it server side so that it can be consistent for all drivers and so that the server team can evolve what is considered contradictory over time if necessary without all drivers needing to be changed.

Generated at Thu Feb 08 04:09:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.