-
Type: Bug
-
Resolution: Won't Fix
-
Priority: Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: bsondump
-
None
-
Environment:All
When bsondump is displaying UTF-8 strings, it does some escaping of certain characters (quotes, backslashes, tabs, some others) but it otherwise acts as if the UTF-8 is ready for display on the screen. If there is bad UTF-8, it is up to the display driver or terminal program or some other downstream component to decide how to show this bad data. This means that runs of bsondump of the same data on two different systems could display differently. It would be better if bsondump "fixed up" the bad UTF-8 by replacing bad sequences with the Unicode replacement character so that the output would be similar no matter what the display driver did with bad UTF-8.
bsondump can't really flag the data in the output stream without possibly messing up the JSON display, which might be intended for some post-processing filter. If passing bad UTF-8 to stdout as-is is a feature for some purposes, maybe we should add a --raw option to the command line to make that happen.
Somewhat related, it would be nice if the "--type debug" option displayed something more informative than "bad utf8 String!". The debug mode of bsondump only displays structure and sizes except for this UTF-8 checking: as long as it's checking and displaying an error, maybe it could show the offending bytes in hex and show their location in the string. This might make bsondump more useful as a debugging aid.