-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: None
-
None
-
Needed
Summary
Drivers should introduce workarounds for a longstanding server bug where invalid UTF-8 can be returned in the responses to write commands.
Motivation
There is a longstanding issue in the server where error messages can be truncated in the middle of a UTF-8 code point, resulting in the driver receiving invalid UTF-8 data (SERVER-24007). Users of a number of drivers have encountered this issue (see RUST-648, RUBY-2560, NODE-3627, CDRIVER-2453).
Some drivers have implemented workarounds for this issue to avoid erroring in these scenarios; for example PYTHON-1090 and NODE-3670 switch the drivers to replace invalid Unicode characters rather than erroring when encountering them in write command responses. Some drivers may already automatically handle this situation gracefully.
While this is a server bug, driver users have been encountering it for a while and will continue to do on older server versions even once it is fixed, so we should consider taking a similar approach to what Python and Node have done in all drivers.
To be specific, when decoding writeErrors in a server response, drivers should not error if invalid UTF-8 is encountered and should use lossy/replacement behavior instead.
Note that a couple of related DRIVERS tickets exist which cover slightly different subjects/cases where invalid UTF-8 can be encountered:
- DRIVERS-1634 proposes drivers have uniform treatment when users provide data containing UTF-8
- DRIVERS-1936 proposes drivers should have an option to disable UTF-8 validation
Who is the affected end user?
Users of any driver that does not have a workaround for this issue in place.
How does this affect the end user?
They get confusing errors about invalid UTF-8 rather than a more helpful error message from the server.
How likely is it that this problem or use case will occur?
Fairly likely. A number of different ways to encounter it are documented in related tickets.
If the problem does occur, what are the consequences and how severe are they?
Users get cryptic error messages they are unable to debug.
Is this issue urgent?
Nothing is on fire, but we should consider addressing it sooner rather than later.
Is this ticket required by a downstream team?
At this time, no. This could come up in Compass and mongosh, but the Node team has already released their workaround.
Is this ticket only for tests?
No, there is a functional change proposed as well.
- is related to
-
JAVA-5575 Java Driver allows inserting invalid UTF-8 as string values
- Closed
-
SERVER-93732 The server should reject inserting/updating strings containing invalid UTF-8
- Needs Scheduling
- related to
-
RUBY-2560 EncodingError raised when server returns invalid UTF-8 in error messages derived from user input
- Backlog
-
SERVER-24007 Server can return invalid UTF8 for error messages due to truncation in the middle of a code point
- Backlog
-
CDRIVER-2453 Invalid bson returned in bulk operation reply in some cases
- Closed
-
RUST-886 Use Lossy UTF8 Decoding when decoding writeErrors returned from the server
- Closed
-
NODE-3627 Getting "Invalid UTF-8 string in BSON document" instead on unique constraint error on bulkWrite.replaceOne
- Closed
-
NODE-3670 Flexible BSON Validation
- Development Complete
-
RUST-648 Decoding a a document with lossy utf8 conversion #226
- Closed
-
PYTHON-1090 Use 'replace' error handler when decoding write responses
- Closed
-
DRIVERS-1936 Drivers should have option to disable UTF-8 validation for BSON strings
- Backlog