[SERVER-55442] Server returns invalid utf-8 in duplicate key error message after truncating user input Created: 23/Mar/21  Updated: 05/Apr/21  Resolved: 05/Apr/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.3, 4.9.0-alpha4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Oleg Pudeyev (Inactive) Assignee: David Storch
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-24007 Server can return invalid UTF8 for er... Backlog
Problem/Incident
causes RUBY-2560 EncodingError raised when server retu... Backlog
Operating System: ALL
Steps To Reproduce:

Repro: https://github.com/p-mongo/tests/tree/master/ruby-2560

Sprint: Query Execution 2021-04-19
Participants:

 Description   

When a unique index is defined on a collection, and data is inserted that contains duplicates, the server includes an excerpt of the duplicating data into the error message.

When the data being inserted is multi-byte utf-8, it appears that the server truncates the data without regard for utf-8 characters. When the truncated data is incorporated into the error message, the entire string is no longer valid utf-8.

Test code in Ruby:

require 'mongo'
 
client = Mongo::Client.new(['localhost:14400'])
 
client['foo'].drop
client['foo'].indexes.create_one({k: 1}, unique: true)
 
rep = '(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻'
 
client['foo'].insert_one(k: rep*10)
client['foo'].insert_one(k: rep*10)

The error message returned is:

E11000 duplicate key error collection: admin.foo index: k_1 dup key: { k: "(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□�..." }

The libbson utf-8 validator that the Ruby driver uses complains about it thusly:

/home/w/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/bson-4.12.0/lib/bson/hash.rb:111:in `get_hash': String E11000 duplicate key error collection: admin.foo index: k_1 dup key: { k: "(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□�..." } is not valid UTF-8: bogus high bits for continuation byte (EncodingError)

The error message is returned as a BSON string, which according to my understanding of http://bsonspec.org/spec.html must contain valid utf-8 characters.

This was reported in https://jira.mongodb.org/browse/RUBY-2560. I verified against 2.6.12, 4.4.3 and 4.9.0-alpha5 servers.



 Comments   
Comment by David Storch [ 05/Apr/21 ]

Closing as a duplicate of SERVER-24007.

Comment by Kyle Suarez [ 30/Mar/21 ]

Assigning to david.storch to see if this ticket is a duplicate of another ticket. In any case, this may be a good candidate to nominate to the quick win bucket?

Comment by Oleg Pudeyev (Inactive) [ 23/Mar/21 ]

This issue is difficult to work around in the driver because the driver parses the entire response rather than the error message individually. In an environment which validates utf-8 strings, the driver would have to parse the entire response while fixing invalid utf-8, which could return wrong data to the applications. I elaborated on this in https://jira.mongodb.org/browse/RUBY-2560?focusedCommentId=3679006&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3679006.

Generated at Thu Feb 08 05:36:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.