|
I've had a conversation with @Geert, about how BSON validation functions in the server.
BSON validation ensures that the BSON is structurally well-formed, and is capable of being iterated, but doesn't validate the actual contents of non-structural elements. A minor exception for this is booleans, which are validated to be either true or false. We don't validate regexes will compile, or that strings are UTF-8 encoded. Effectively, strings are binary blobs of data, which may or may not represent human readable text in any given character set encoding. Today, servers will accept strings formed by drawing entropy from /dev/urandom. It's also plausible that strings can store human readable text encoded with Latin-1 or Shift-JIS.
Enforcing UTF-8 validation of strings breaks the current behaviour of being able to save any binary input. Further, sanitizing a database which contains non-UTF-8 strings may be impossible because not all strings contain text. Finally, extra enforcement would prevent restoration of backups taken on older servers.
Generally speaking, if you put an object into the database, and request it again, you should get a byte-for-byte identical representation of the object. For example, if you insert a field which contains a particular representation of NaN, when you query that document, you will get the same NaN representation back. Character set re-encoding, performed by an explicit upgrade operation, would violate this property, and would for example change hashsums of stored documents.
Distinctions between user data and MongoDB command data could be made during command parse, particularly with TypedCommand, and it is perfectly valid to enforce the properties of fields in particular known command invocations.
|