[COMPASS-4944] Fail gracefully instead of "Invalid UTF-8 string in BSON document" Created: 13/Jul/21  Updated: 03/Oct/23

Status: Open
Project: Compass
Component/s: CRUD, Document Validation, Import/Export
Affects Version/s: 1.27.1, 1.34.2
Fix Version/s: 1.32.6

Type: Bug Priority: Major - P3
Reporter: Jake Strang Assignee: Unassigned
Resolution: Unresolved Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows for sure, but I think all.


Attachments: PNG File image (1).png    
Issue Links:
Documented
Related
related to NODE-3784 Expose disabling utf-8 validation opt... Closed
Story Points: 2
Documentation Changes: Needed
Documentation Changes Summary:

Not sure where this would go, but this is an infrequent issue that comes up. Sometimes, somehow (almost certainly due to a driver bug or maybe someone using the wire protocol directly?) some bad utf8 can make it into a document. This is because the server doesn't validate utf8 and relies on the drivers to do that.

As explained in the comments on this ticket, the node driver supports a connection url flag for disabling utf8 validation. This is not just for getting around these situations where you have bad data in the database but also because utf8 validation carries a small performance penalty.

So you can just stick that param in a connection string and mongosh/compass (or whatever uses a driver that supports it) should disable utf8 validation. Which gives you the slight performance increase and it means that you can then see the broken documents in their broken state.

With this PR compass now also exposes this option in the Advanced Connection Options, Advanced tab. The URI Options' "Select key" dropdown now has a new option under Miscellaneous Configuration for "enableUtf8Validation". To you use it, select it and set the value false. (since it defaults to true).

This kind of situation does happen from time to time and this workaround should probably be documented somewhere.

Sprint: Iteration Fish, Iteration Grouper

 Description   

Problem Statement/Rationale

Compass will not display any results that would include a document containing an invalid UTF-8 string, and in place of the results displays the error "Invalid UTF-8 string in BSON document". This is also true of exporting: Compass will not allow a set of documents to be exported if one of them contains an invalid UTF-8 string (provides the same error).

This is in contrast to Compass v1.26.1 which did display/export these documents, but substituted the replacement character � for any invalid bytes.

Steps to Reproduce

1. Create a document in MongoDB that contains a string field with invalid UTF-8 bytes. (I do not know how to actually perform this step but it seems to be possible).

3. View the document in Compass, and also attempt to export the collection that contains this document.

Expected Results

I expect the behavior to be the same as it is in v1.26.1. The document can be viewed in Compass, with invalid chars replaced by �. The document can be exported using the "Export Collection" tool.

Actual Results

The document is not viewable in Compass and displays the error "Invalid UTF-8 string in BSON document". Clicking "Export Collection" (even if not viewing the document at the time, but exporting the full collection) and saving to a file gives the same error: "Invalid UTF-8 string in BSON document".

Additional Notes

The errors occurred on both v1.27.1 and v1.28.1.



 Comments   
Comment by Githook User [ 28/Sep/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: update-compass-shell-to-shared-config
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Githook User [ 30/Aug/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: compass-settings
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Githook User [ 29/Jul/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: remove-rc-from-evergreen
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Githook User [ 27/Jul/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: macos-arm-build
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Githook User [ 26/Jul/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: COMPASS-5678-query-history-as-popover
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Githook User [ 26/Jul/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: COMPASS-5672-update-crud-toolbar-to-lg
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Githook User [ 26/Jul/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: remote
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Githook User [ 26/Jul/22 ]

Author:

{'name': 'Le Roux Bodenstein', 'email': 'lerouxb@gmail.com', 'username': 'lerouxb'}

Message: feat(connection-form): Add enableUtf8Validation option to advanced url options COMPASS-4944 (#3284)

Add enableUtf8Validation option to advanced url options
Branch: main
https://github.com/mongodb-js/compass/commit/06b12f867dc64bba51eb97d8cc3766d8e7b4d13c

Comment by Le Roux Bodenstein [ 25/Jul/22 ]

We're planning on exposing the option in the Advanced tab on the connection screen, but two relatively recent changes already made it possible to work around this:

The driver now has an option enableUtf8Validation which defaults to true. You can pass ?enableUtf8Validation=false to a connection string which disables the validation and then you won't see the error.

Since we redid the connection form compass won't strip the option from connection URIs, so you can edit the connection string manually and add the option there.

I just tested this locally by hacking the driver to allow me to insert bad UTF8 and confirmed that I get the same error when trying to view that document. This was on mongodb://localhost:27017. Then I connected with mongodb://127.0.0.1:27017/?enableUtf8Validation=false" and I can see the broken document rather than the error.

 

FYI mongosh has the same behaviour:

 

% mongosh "mongodb://127.0.0.1:27017/"
.....
> db.test.find()
BSONError: Invalid UTF-8 string in BSON document 

% mongosh "mongodb://127.0.0.1:27017/?enableUtf8Validation=false"
.....
> db.test.find()
[
  {
    _id: ObjectId("62de7aaa15cab26cc822bb15"),
    specialKeyWithBadUtf8: '�'
  }
] 

 

Comment by Patrick Bennett [ 20/Jun/22 ]

I would consider this critical and this issue needs addressed.  You have data unable to be read simply because it's not valid Unicode - yet, clearly it's being stored as just bytes.  We have to be able to access our data.  

Comment by Anna Henningsen [ 20/Jun/22 ]

patrick@txnlab.dev Yes, this is something that should not be happening in the first place. I could imagine that some drivers for languages in which the native "string" type is a sequence-of-bytes type rather than sequence-of-characters type (e.g. C/C++) don’t perform this extra validation step on insert.

Comment by Patrick Bennett [ 19/Jun/22 ]

Shouldn't the drivers prevent invalid data from being inserted in the first place?  I'm now suddenly seeing this same behavior and frankly, find it kind of unacceptable.  I can't even 'find' the bad record.

Comment by Massimiliano Marcon [ 14/Jul/21 ]

jake@convictional.com as you suspected, Compass 1.27+ uses the most recent versions of the Node.js driver and BSON library, which tend to be more spec compliant than earlier versions of the same libraries.

There is unfortunately not much we can do to force the driver into having the old behavior and we can't downgrade the driver as earlier versions would not work well with the new MongoDB 5.0 and with Atlas Serverless.

I am not sure how a non-UTF-8 string ended up being stored in MongoDB. The workaround atm is to use a version of Compass pre-1.26 to export the data if that works well as you mentioned.

Comment by Jake Strang [ 14/Jul/21 ]

Hi @Massimiliano Marcon, thanks for your reply.

The BSON spec says strings have to contain valid UTF-8 characters. So Compass (and the underlying driver) work as designed.

The BSON spec determines what makes a string valid but it doesn't prescribe how to handle invalid documents, so I don't think its accurate to say this behavior follows naturally from the BSON spec.

It may be that this works as designed, but the design must have changed between v1.26.1 and v1.27.1. Since there's no mention of the change in the release notes it seems like it's probably an unintentional side effect (updated driver?) not a design change. Even if it is a design change then I don't think it's a good design change: a user that wants to export a collection with 100000 documents now cannot if one document contains one invalid string, whereas the way Compass handled it before (just allow it because why not?) worked perfectly well.

Comment by Massimiliano Marcon [ 14/Jul/21 ]

The BSON spec says strings have to contain valid UTF-8 characters. So Compass (and the underlying driver) work as designed.

Generated at Wed Feb 07 22:37:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.