Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23028

Add test coverage for invalid UTF-8 in CollatorInterfaceICU

    XMLWordPrintableJSON

Details

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • 3.3.4
    • None
    • Querying
    • None
    • Fully Compatible
    • Query 11 (03/14/16), Query 12 (04/04/16)

    Description

      icu::Collator handles invalid UTF-8 using the "first approach" suggested here in the technical standard for the unicode collation algorithm. Specifically, each ill-formed subsequence of bytes is collated as if were the replacement character, codepoint U+FFFD. This is the behavior that we expect the collator in the MongoDB layer, CollatorInterfaceICU, to provide. It must impose a total order such that the ordering of a set of strings with respect to a collator is transitive (if a <= b and b <= c, then a <= c), antisymmetric (if a <= b and b <= a, then a = b), and total (a <=b or b <= a). These properties must hold true for any sequence of bytes, regardless of whether or not these bytes encode valid UTF-8 text. Furthermore, the ordering of any set of byte sequences must remain consistent across releases in order to handle indices built using an older version of the server.

      The total order properties are indeed provided by the replacement character approach implemented in the ICU 56.1 UTF-8 collation algorithm.
      Nonetheless, we should add test coverage for CollatorInterfaceICU's comparisons involving invalid UTF-8 both using CollatorInterfaceICU::getComparisonKey() and CollatorInterfaceICU::compare(). These tests will demonstrate that invalid subsequences get weighted as U+FFFD.

      Attachments

        Activity

          People

            david.storch@mongodb.com David Storch
            david.storch@mongodb.com David Storch
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: