[DRIVERS-2008] Default to lossy/replacement behavior when decoding UTF-8 in writeErrors Created: 15/Dec/21  Updated: 31/Mar/22

Status: Backlog
Project: Drivers
Component/s: None
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Kaitlin Mahar Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to RUBY-2560 EncodingError raised when server retu... Backlog
related to SERVER-24007 Server can return invalid UTF8 for er... Backlog
related to CDRIVER-2453 Invalid bson returned in bulk operati... Closed
related to RUST-886 Use Lossy UTF8 Decoding when decoding... Closed
related to NODE-3627 Getting "Invalid UTF-8 string in BSON... Closed
related to NODE-3670 Flexible BSON Validation Closed
related to RUST-648 Decoding a a document with lossy utf8... Closed
related to PYTHON-1090 Use 'replace' error handler when deco... Closed
related to DRIVERS-1936 Drivers should have option to disable... Backlog
Driver Changes: Needed

 Description   

Summary

Drivers should introduce workarounds for a longstanding server bug where invalid UTF-8 can be returned in the responses to write commands.

Motivation

There is a longstanding issue in the server where error messages can be truncated in the middle of a UTF-8 code point, resulting in the driver receiving invalid UTF-8 data (SERVER-24007). Users of a number of drivers have encountered this issue (see RUST-648, RUBY-2560, NODE-3627, CDRIVER-2453).

Some drivers have implemented workarounds for this issue to avoid erroring in these scenarios; for example PYTHON-1090 and NODE-3670 switch the drivers to replace invalid Unicode characters rather than erroring when encountering them in write command responses. Some drivers may already automatically handle this situation gracefully.

While this is a server bug, driver users have been encountering it for a while and will continue to do on older server versions even once it is fixed, so we should consider taking a similar approach to what Python and Node have done in all drivers.

To be specific, when decoding writeErrors in a server response, drivers should not error if invalid UTF-8 is encountered and should use lossy/replacement behavior instead.

Note that a couple of related DRIVERS tickets exist which cover slightly different subjects/cases where invalid UTF-8 can be encountered:

  • DRIVERS-1634 proposes drivers have uniform treatment when users provide data containing UTF-8
  • DRIVERS-1936 proposes drivers should have an option to disable UTF-8 validation

Who is the affected end user?

Users of any driver that does not have a workaround for this issue in place.

How does this affect the end user?

They get confusing errors about invalid UTF-8 rather than a more helpful error message from the server.

How likely is it that this problem or use case will occur?

Fairly likely. A number of different ways to encounter it are documented in related tickets.

If the problem does occur, what are the consequences and how severe are they?

Users get cryptic error messages they are unable to debug.

Is this issue urgent?

Nothing is on fire, but we should consider addressing it sooner rather than later.

Is this ticket required by a downstream team?

At this time, no. This could come up in Compass and mongosh, but the Node team has already released their workaround.

Is this ticket only for tests?

No, there is a functional change proposed as well.



 Comments   
Comment by Kaitlin Mahar [ 23/Mar/22 ]

Digging through all of the related tickets to this, as far as I can tell this problem has only ever been observed specifically with duplicate key errors. So it may be sufficient for drivers to apply this lossy decoding logic only for responses to insert and update commands.

Generated at Thu Feb 08 08:24:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.