Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-95394

Need significant changes to validUTF8 and escape functions

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Server Programmability
    • ALL
    • SP Prioritized List

      Essentially all of the functions in str_escape.cpp are built on a broken foundation helper function escape(). The biggest problem is that it only handles 1 and 2 byte UTF-8 sequences (up to U+07FF). 3 and 4 byte sequences are validated, but then dropped on the floor rather than being output. Note that 4-byte sequences are outside of the BMP so will need to be encoded as surrogate pairs in escaped JSON (or simply emitted to the output unescaped after validation). It also has some issues with not correctly validating UTF-8, but SERVER-75942 already has a PR in progress to fix that.

      We should also take care of some performance issues while we are in there. If we are already using this during BSON validation (edit: we are!), that will improve the performance of something on the critical path of every command. And if we aren't, maybe we can make it fast enough that we can validate all strings.

      • Consider separating the logic for validation from escaping. Validation is likely to be used in much more performance sensitive situations, and it is easier to optimize it if we don't have the escaping logic mixed in.
      • We should try to use SIMD as much as possible for validation and escaping. At the very least we can process (read, check, and optionally write) up to 16 bytes of ascii at a time that way. The writer function can efficiently check if there are any bytes < 0x20 || == 0x7f to find the first non-printable character (if any) in the SIMD register. We already have ByteVector for this exact purpose (as used here). We could also consider some of the tricks from simdjson to handle 2 byte sequences this way as well, but that is considerably more complicated.
        • Since most field names in BSON are short, we should consider having validUTF8 take an extra parameter representing the end of the containing buffer and allowing it to reads up to the end of the buffer rather than the end of the string. That would allow it to handle most fields with a single SIMD load.
        • Consider a variant of validUTF8 that also looks for the 0x00 byte so that we can use a single function (and a single SIMD load) to both check if a field is valid, and find out how many bytes it has, rather than using a separate call to strnlen.
      • We should use `std::countl_ones(leadingByte)` to decide which class we are in, possibly after a if (MONGO_likely(leadingByte) <= 0x7f) special case for ascii (which is also useful on x86 without the LZCNT instruction so it needs a branch before BSR if the input could be zero anyway.
      • Since we are already decoding each non-ascii codepoint to check for overlong encodings we should pass the decoded codepoint to the writer function so that it doesn't need to do the same work again.
      • Alternatively we should consider not decoding the codepoint as part of validation and just checking that each byte is in the valid range for that byte.
      • We should consider renaming that file to something with utf8 in the name to be more discoverable. We should also rename that function to isValidUtf8.

            Assignee:
            Unassigned Unassigned
            Reporter:
            mathias@mongodb.com Mathias Stearn
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: