[SERVER-23028] Add test coverage for invalid UTF-8 in CollatorInterfaceICU Created: 09/Mar/16  Updated: 14/Apr/16  Resolved: 23/Mar/16

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 3.3.4

Type: Task Priority: Major - P3
Reporter: David Storch Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Sprint: Query 11 (03/14/16), Query 12 (04/04/16)
Participants:

 Description   

icu::Collator handles invalid UTF-8 using the "first approach" suggested here in the technical standard for the unicode collation algorithm. Specifically, each ill-formed subsequence of bytes is collated as if were the replacement character, codepoint U+FFFD. This is the behavior that we expect the collator in the MongoDB layer, CollatorInterfaceICU, to provide. It must impose a total order such that the ordering of a set of strings with respect to a collator is transitive (if a <= b and b <= c, then a <= c), antisymmetric (if a <= b and b <= a, then a = b), and total (a <=b or b <= a). These properties must hold true for any sequence of bytes, regardless of whether or not these bytes encode valid UTF-8 text. Furthermore, the ordering of any set of byte sequences must remain consistent across releases in order to handle indices built using an older version of the server.

The total order properties are indeed provided by the replacement character approach implemented in the ICU 56.1 UTF-8 collation algorithm.
Nonetheless, we should add test coverage for CollatorInterfaceICU's comparisons involving invalid UTF-8 both using CollatorInterfaceICU::getComparisonKey() and CollatorInterfaceICU::compare(). These tests will demonstrate that invalid subsequences get weighted as U+FFFD.



 Comments   
Comment by Githook User [ 23/Mar/16 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-23028 add test coverage for invalid UTF-8 passed to CollatorInterfaceICU
Branch: master
https://github.com/mongodb/mongo/commit/be5580e4a9c90a32616a8845e94f7a34bcfc0a74

Comment by David Storch [ 10/Mar/16 ]

schwerin, I've filled in the description. Ideally we would be validating UTF-8 strings at a higher layer, but I think we need to define the behavior of the collator in case invalid strings slip through or are present in existing databases. Note that CollatorInterface::compare() is not expected to fail, and right now is expected to have defined behavior for any two StringData objects passed as input.

Comment by Andy Schwerin [ 10/Mar/16 ]

Handle how? Description please.

Generated at Thu Feb 08 04:02:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.