|
icu::Collator handles invalid UTF-8 using the "first approach" suggested here in the technical standard for the unicode collation algorithm. Specifically, each ill-formed subsequence of bytes is collated as if were the replacement character, codepoint U+FFFD. This is the behavior that we expect the collator in the MongoDB layer, CollatorInterfaceICU, to provide. It must impose a total order such that the ordering of a set of strings with respect to a collator is transitive (if a <= b and b <= c, then a <= c), antisymmetric (if a <= b and b <= a, then a = b), and total (a <=b or b <= a). These properties must hold true for any sequence of bytes, regardless of whether or not these bytes encode valid UTF-8 text. Furthermore, the ordering of any set of byte sequences must remain consistent across releases in order to handle indices built using an older version of the server.
The total order properties are indeed provided by the replacement character approach implemented in the ICU 56.1 UTF-8 collation algorithm.
Nonetheless, we should add test coverage for CollatorInterfaceICU's comparisons involving invalid UTF-8 both using CollatorInterfaceICU::getComparisonKey() and CollatorInterfaceICU::compare(). These tests will demonstrate that invalid subsequences get weighted as U+FFFD.
|