[SERVER-25683] Collation support does not reject invalid combinations of fields Created: 18/Aug/16 Updated: 19/Nov/16 Resolved: 30/Aug/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.3.11 |
| Fix Version/s: | 3.3.14 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Derick Rethans | Assignee: | David Storch |
| Resolution: | Done | Votes: | 0 |
| Labels: | collation | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Sprint: | Query 2016-09-19 |
| Participants: |
| Description |
|
When creation an index with a locale, the server rejects things like this:
But it does not seem to reject the following:
A caseLevel definition only make sense for strength 1 or 2, as per "Optional fields" in the spec, with other levels it would just get ignored. When reviewing the drivers specification, david.golden had flagged this. The consensus on the drivers side seemed to be that we should warn users about this "ignored field", but that this should be the responsibility of the server to throw an exception for – just like we rely on the server to that for readConcerns: https://github.com/mongodb/specifications/blob/master/source/read-write-concern/read-write-concern.rst#unknown-levels-and-additional-options-for-string-based-readconcerns |
| Comments |
| Comment by Githook User [ 30/Aug/16 ] |
|
Author: {u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}Message: |
| Comment by David Storch [ 29/Aug/16 ] |
|
david.golden, LGTM! We should definitely incorporate something like this as we continue to revise the docs for the collation feature. |
| Comment by David Golden [ 29/Aug/16 ] |
|
david.storch, after reviewing UTS #10 and UTS #35, I agree with your assessment of the options. How about this as (verbose) documentation for caseLevel: When caseLevel is true, collation adds a comparison level consisting only of case characteristics that affects collation differently by strength level:
|
| Comment by David Storch [ 26/Aug/16 ] |
|
Here are the results of my thinking, option by option, about whether or not there are combinations that should be rejected:
|
| Comment by David Storch [ 26/Aug/16 ] |
|
My reading of the docs is that turning on caseLevel is meaningful at strengths greater than 2. Turning on this flag creates a new comparison level in between the secondary and tertiary levels which considers only case differences. I believe it is possible that some tailoring of the Unicode Collation Algorithm would involve tertiary differences other than case. If you would like case differences to be more significant than these other tertiary differences (or quaternary or higher differences, for that matter), then it would be meaningful to enable this option at strengths 3 or higher. I think the common use case would be to consider case differences at strength 1, but we shouldn't bar other use cases. That said, I believe there are some combinations of options that we should reject. For example, the backwards option specifically means "backwards secondary weighting". It applies specifically at the secondary level, which means that it is not meaningful at strength 1. I'm going to try to come up with a list of all such combinations that should be rejected. |
| Comment by David Golden [ 18/Aug/16 ] |
|
I think investigation is warranted. The docs here, here and here strongly imply that it adds case level considerations to the primary or secondary levels, which otherwise wouldn't consider case, without adding the other tertiary level considerations. From my reading of the docs, caseLevel true and strength 3+ is redundant. Perhaps if we're very careful to document caseLevel similar to UTS#35, then we don't need to worry about contradictory instructions. I.e. saying "caseLevel true turns on case sensitivity" is wrong, but saying "caseLevel true modifies strength level 1 or 2 to include case differences in the collation order" would be correct and nudge users towards correct usage. |
| Comment by David Storch [ 18/Aug/16 ] |
|
Great, glad we agree on the validation bit |
| Comment by Derick Rethans [ 18/Aug/16 ] |
|
david.storch "the drivers should not issue an error or warning in this case" ← That's what I said. We would like the server to raise errors for incompatible fields. |
| Comment by David Golden [ 18/Aug/16 ] |
|
We're agreed. Drivers should not do any validation. In the specific caseLevel example, I think the logic should be that the existence of a "caseLevel" field when strength > 2 should error. |
| Comment by David Storch [ 18/Aug/16 ] |
|
david.golden derick, the drivers should not issue an error or warning in this case. Please do not introduce driver-side validation of the collation specification. It is reasonable to expect that the server does this validation. Since ICU accepts any combination of options, our integration with ICU is currently happy to accept any combination of options, even if the combination is not meaningful. At some point our team should do a more thorough review of the collation options to decide whether we can confidently say which combinations of options are illegal. That will be the work represented by this ticket. |
| Comment by David Golden [ 18/Aug/16 ] |
|
I realize that ICU is happy to ignore this combination and just use strength. The argument from the drivers team is that allowing users to express contradictory desires leads to frustrated users and additional customer support questions. We want it server side so that it can be consistent for all drivers and so that the server team can evolve what is considered contradictory over time if necessary without all drivers needing to be changed. |