[SERVER-12204] Buffer::readUTF8String in bson_validate.cpp should validate utf8 Created: 24/Dec/13  Updated: 22/Sep/20  Resolved: 17/Mar/20

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Mathias Stearn Assignee: Spencer Jackson
Resolution: Won't Do Votes: 1
Labels: platforms-re-triaged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-27426 validateBSON should assert that Array... Closed
Duplicate
duplicates SERVER-21753 Improve UTF-8 validation of BSON docu... Closed
Related
related to SERVER-51083 Problem with regex index bounds Closed
related to SERVER-29466 Cannot query over fields containing i... Closed
Sprint: Security 2020-02-24, Security 2020-03-09, Security 2020-03-23
Participants:

 Description   

Currently it does not. readCString probably should too, but at least UTF8 isn't part of it's name.

Example failing test:

    TEST(BSONValidateFast, InvalidUTF8) {
        BSONObj x = BSON("invalidUTF8" << "\xFF"); // this byte should never be in well-formed utf8.
        ASSERT_NOT_OK(validateBSON(x.objdata(), x.objsize()));
    }



 Comments   
Comment by Spencer Jackson [ 17/Mar/20 ]

I've had a conversation with @Geert, about how BSON validation functions in the server.
BSON validation ensures that the BSON is structurally well-formed, and is capable of being iterated, but doesn't validate the actual contents of non-structural elements. A minor exception for this is booleans, which are validated to be either true or false. We don't validate regexes will compile, or that strings are UTF-8 encoded. Effectively, strings are binary blobs of data, which may or may not represent human readable text in any given character set encoding. Today, servers will accept strings formed by drawing entropy from /dev/urandom. It's also plausible that strings can store human readable text encoded with Latin-1 or Shift-JIS.

Enforcing UTF-8 validation of strings breaks the current behaviour of being able to save any binary input. Further, sanitizing a database which contains non-UTF-8 strings may be impossible because not all strings contain text. Finally, extra enforcement would prevent restoration of backups taken on older servers.

Generally speaking, if you put an object into the database, and request it again, you should get a byte-for-byte identical representation of the object. For example, if you insert a field which contains a particular representation of NaN, when you query that document, you will get the same NaN representation back. Character set re-encoding, performed by an explicit upgrade operation, would violate this property, and would for example change hashsums of stored documents.

Distinctions between user data and MongoDB command data could be made during command parse, particularly with TypedCommand, and it is perfectly valid to enforce the properties of fields in particular known command invocations.

Generated at Thu Feb 08 03:27:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.