[SERVER-65926] Display string truncation not respecting UTF-8 character boundaries Created: 25/Apr/22  Updated: 27/Oct/23  Resolved: 18/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.3.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Justin Casali Assignee: Chris Kelly
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

See SERVER-11873

Participants:

 Description   

In our logs long strings are being truncated to 150 bytes irrespective of UTF-8 character boundaries. While this is not an issue for ASCII characters any multi-byte character that is on the truncation boundary gets cut at exactly 150 bytes and produces an invalid UTF-8 byte sequence contaminating our log file with an improper encoding.

For example 'あ' is "\xE3\x81\x82" in UTF-8 bytes but if the last byte is cut it produces the invalid UTF-8 byte sequence of "\xE3\x81".

The code causing this issue can be found here. It should be updated to be aware of character boundaries in its trimming logic.



 Comments   
Comment by Chris Kelly [ 18/May/22 ]

Hi Justin,

Thanks for the clarification! This will help for future reference.
I'll go ahead and close this ticket for now.

Regards,
Christopher

Comment by Justin Casali [ 17/May/22 ]

Hi Christopher,

Apologies for the late response, was out all last week.

Looking back the truncation errors we were having are in version 4.2 which is before the new json structured logging was added in 4.4. The underlying code that was causing the issue is still present in the latest version although it looks like from your tests the truncation logic has switched to not displaying strings at all instead of partial chopped ones. 

Given that this issue does not appear in newer versions and we managed to implement a fix on our end in 4.2 I think we can close this bug report.

Appreciate the help,
Justin

Comment by Chris Kelly [ 17/May/22 ]

Hi Justin,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please send your mongod logs to the link provided above?

Regards,
Christopher

Comment by Chris Kelly [ 06/May/22 ]

Hi Justin,

I went ahead and tried reproducing your problem on a single mongod running 5.3.1 and did not observe this behavior with the reproduction you refer to in SERVER-11873. On 5.3.1, the output looks like the following:

{"t":\{"$date":"2022-05-06T13:13:14.794-04:00"},"s":"I",  "c":"WRITE",    "id":51803,   "ctx":"conn7","msg":"Slow query","attr":\{"type":"update","ns":"test.test","appName":"MongoDB Shell","command":{"q":{},"u":\{"$set":{"description":"Anupam Roy is a Bengali lyricist, composer, singer from Kolkata, West Bengal, India. He is best known for his song Amake amar moto thakte dao (আমাকে আমার মত থাকতে দাও), which appeared on the soundtrack to the 2010 film Autograph"}}

The issue does, however, show itself on version 2.4.8 which that ticket refers to.

I also tried decreasing the maxLogSizeKB to 1 and was able to keep adding characters to the string with no truncation observed, up until that limit where it would no longer display the string whatsoever.

In order to look into this further, it'd be helpful to see your mongod log and the $dbpath/diagnostic.data directory (the contents are described here). I've created a secure upload portal for you to upload an example covering the time you see this. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

Additionally, if there is any other specific information you can provide on reproducing your specific problem in the logs, or something I may have omitted, that would be helpful before we pass it on.

Regards,
Christopher

Generated at Thu Feb 08 06:04:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.