[SERVER-2939] Support Unicode fully in the Mongo shell (was "Linenoise UTF8 support") Created: 12/Apr/11  Updated: 12/Jul/16  Resolved: 13/Jun/12

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: None
Fix Version/s: 2.1.2

Type: Improvement Priority: Major - P3
Reporter: Mathias Stearn Assignee: Tad Marshall
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-272 Foreign characters in shell Closed
Related
is related to SERVER-6086 Unicode/UTF-8 in the shell needs to h... Closed
Participants:

 Description   

Allow entry and display of Unicode characters and ensure correct handling of Unicode in all interactions with the server.



 Comments   
Comment by Tad Marshall [ 13/Jun/12 ]

The issue referenced above (Jun 04 2012 01:17:08 PM UTC), where console output in Windows wasn't handled correctly, is now fixed.

The remaining issues in the UTF-8/Unicode feature are:
1) Combining characters will not interact with cursor movement correctly. They will display correctly, but the cursor position will be offset. The combining characters need to be treated as "zero width" to do this properly.
2) Chinese, Japanese and Korean characters may display taking up two screen positions, but cursor positioning will not be correct. These characters need to be treated as "double width" to do this properly.

I'm going to resolve this ticket and file a new one for the remaining issues.

Comment by auto [ 13/Jun/12 ]

Author:

{u'date': u'2012-06-13T02:03:21-07:00', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}

Message: SERVER-2939 fix Windows console output

For Windows, when writing to the console, convert text to UTF-16 and
write it to the screen using WriteConsoleW instead of fwrite or _write.
Branch: master
https://github.com/mongodb/mongo/commit/70546ba57409051eeef817304955a411f46b763b

Comment by Tad Marshall [ 04/Jun/12 ]

The _write calls are still problematic in the Windows console. The fwrite is unusable because of its sending the first byte of a UTF-8 character to the console as a single write, leading to corrupted display. But the _write call is also challenging because of two features:
1) It won't necessarily write everything you ask it to write in a single call, so you need to check the return value and call it again if some of your data has not yet been written;
2) It returns the number of characters written, not the number of bytes, so if you are sending it UTF-8 (multi-byte-per-character) data then you need to parse the UTF-8 data to figure out what the return value is telling you;
3) Because of the interaction of (1) and (2), it can split UTF-8 characters, causing display corruption.
This needs to be fixed for version 2.1.2. The likely fix is to change to using the Windows Console API instead of the C runtime functions.

Comment by auto [ 13/May/12 ]

Author:

{u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}

Message: SERVER-2939 Windows console _write may not write full buffer

Update my previous change to deal with calls to _write that
don't write the requested length. Loop until all characters
have been written. Affects long output strings from JavaScript.
Branch: master
https://github.com/mongodb/mongo/commit/4951e568672740b8bd783402afcb03dfd2db1d9c

Comment by Tad Marshall [ 22/Apr/12 ]

Most of this feature is in 2.1.1. The remaining part, to handle zero-width and double-width characters (for combining characters and wide CJK characters) can go into 2.1.2.

Comment by auto [ 16/Apr/12 ]

Author:

{u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}

Message: SERVER-2939 UTF-8 support for the shell

Major reworking of the internals of linenoise to support UTF-8. Added
Utf8String and Utf32String classes adapted from code by Mathias. Start
of work to handle zero-width and double-width characters (for combining
characters and Chinese-Japanese-Korean wide characters) using code from
Markus Kuhn (called mk_wcwidth as checked in here). Some additional
cleanup would be desirable, but all features should now work with Unicode
in Windows and non-Windows builds.
Branch: master
https://github.com/mongodb/mongo/commit/fc923dbad70755f8c98a8774299bd5061454a69d

Comment by auto [ 22/Mar/12 ]

Author:

{u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}

Message: SERVER-2939 Supporting code for UTF-8 in the shell

This commit lets the shell read UTF-8 from the command
line and fixes a display problem with lines that start
with a UTF-8 character. It does not include the actual
UTF-8 enabling in linenoise, but prepares for it.
Branch: master
https://github.com/mongodb/mongo/commit/3363b199e72a7b41c965fea9483d7825fd353d70

Comment by auto [ 22/Mar/12 ]

Author:

{u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}

Message: SERVER-2939 Add linenoise_utf8.cpp and linenoise_utf.h

New files for UTF-8 support in the shell.
Branch: master
https://github.com/mongodb/mongo/commit/95e4630a7ef89cc7030544984cc4ed5f09408e67

Comment by Tad Marshall [ 10/Feb/12 ]

Yes, the current handling of UTF-8 is very poor, and what you're seeing is the result of code that acts as if everything is ASCII. Backspacing over a three byte UTF-8 character will delete the third byte, leaving corrupt UTF-8. We know what we need to do to fix it, the problem is always competing priorities – other stuff gets done first while this waits for attention. I would very much like to get this fixed in version 2.1.x (and hence fixed in 2.2) and as you can see it is scheduled for the next point release (2.1.1) so hopefully we'll have this working soon. If you could test the first version where we're claiming that this is fixed, which could be a nightly build, that would be great, but 2.0.2 and even 2.1.0 simply don't have code for doing this right. I'm shooting for getting this in before the end of February: since you are watching this Jira ticket, you should see activity when it happens.

Comment by Jan Anderssen [ 10/Feb/12 ]

I have a possibly related observation: In mongo shell, when I enter a multibyte UTF-8 character and then try to delete it, what looks like a whitespace is inserted. The number of these frankenspaces is the same as the difference between byte and "symbol" count (so delete in a string with three ä's and you'll get three extra whitespaces). If I hit delete after entering "äöü" my mongo shell looks like this:

mongos> äü<space><space>_

where _ is the insert point now

After hitting delete multiple times, the insert point will "catch up" again, but with careless deleting/writing, I can also generate invalid UTF-8 sequences, e.g. by entering ä<delete>üö<delete> i'll get

mongos> <?>ü<space>_

where <?> is the diamond-shaped black-on-white question mark character.

Seems like delete deletes bytes not characters here.

(Sorry if this is duplicate or in the wrong place, just seemed to fit with "make the Mongo shell support Unicode properly for all input and output". I'm using MongoDB shell version: 2.0.2 in GNOME-Terminal 2.30.2 with character encoding set to UTF-8.)

Comment by auto [ 04/Jan/12 ]

Author:

{u'login': u'', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}

Message: SERVER-2939 add several static_cast<unsigned char>() casts

Prevent sign-extension of characters that have their high bit set when
passed to routines that take 'int'.
Branch: master
https://github.com/mongodb/mongo/commit/e27891645a2871a65d632862bc13e44fd7e83e31

Comment by Tad Marshall [ 17/Nov/11 ]

I am interpreting this bug to be "make the Mongo shell support Unicode properly for all input and output", meaning keyboard and display for all supported operating systems. Internally, strings will be stored in UTF-8, but that isn't the actual "feature" from the point of view of a user. I will link duplicate bug report to this ticket – this will be the "master" ticket for this feature.

Comment by Brandon Diamond [ 20/Oct/11 ]

Transferring back to Mathias who has already done some legwork on a patch.

Generated at Thu Feb 08 03:01:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.