[SERVER-2939] Support Unicode fully in the Mongo shell (was "Linenoise UTF8 support") Created: 12/Apr/11 Updated: 12/Jul/16 Resolved: 13/Jun/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Shell |
| Affects Version/s: | None |
| Fix Version/s: | 2.1.2 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Mathias Stearn | Assignee: | Tad Marshall |
| Resolution: | Done | Votes: | 4 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Allow entry and display of Unicode characters and ensure correct handling of Unicode in all interactions with the server. |
| Comments |
| Comment by Tad Marshall [ 13/Jun/12 ] |
|
The issue referenced above (Jun 04 2012 01:17:08 PM UTC), where console output in Windows wasn't handled correctly, is now fixed. The remaining issues in the UTF-8/Unicode feature are: I'm going to resolve this ticket and file a new one for the remaining issues. |
| Comment by auto [ 13/Jun/12 ] |
|
Author: {u'date': u'2012-06-13T02:03:21-07:00', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}Message: For Windows, when writing to the console, convert text to UTF-16 and |
| Comment by Tad Marshall [ 04/Jun/12 ] |
|
The _write calls are still problematic in the Windows console. The fwrite is unusable because of its sending the first byte of a UTF-8 character to the console as a single write, leading to corrupted display. But the _write call is also challenging because of two features: |
| Comment by auto [ 13/May/12 ] |
|
Author: {u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: Update my previous change to deal with calls to _write that |
| Comment by Tad Marshall [ 22/Apr/12 ] |
|
Most of this feature is in 2.1.1. The remaining part, to handle zero-width and double-width characters (for combining characters and wide CJK characters) can go into 2.1.2. |
| Comment by auto [ 16/Apr/12 ] |
|
Author: {u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: Major reworking of the internals of linenoise to support UTF-8. Added |
| Comment by auto [ 22/Mar/12 ] |
|
Author: {u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: This commit lets the shell read UTF-8 from the command |
| Comment by auto [ 22/Mar/12 ] |
|
Author: {u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: New files for UTF-8 support in the shell. |
| Comment by Tad Marshall [ 10/Feb/12 ] |
|
Yes, the current handling of UTF-8 is very poor, and what you're seeing is the result of code that acts as if everything is ASCII. Backspacing over a three byte UTF-8 character will delete the third byte, leaving corrupt UTF-8. We know what we need to do to fix it, the problem is always competing priorities – other stuff gets done first while this waits for attention. I would very much like to get this fixed in version 2.1.x (and hence fixed in 2.2) and as you can see it is scheduled for the next point release (2.1.1) so hopefully we'll have this working soon. If you could test the first version where we're claiming that this is fixed, which could be a nightly build, that would be great, but 2.0.2 and even 2.1.0 simply don't have code for doing this right. I'm shooting for getting this in before the end of February: since you are watching this Jira ticket, you should see activity when it happens. |
| Comment by Jan Anderssen [ 10/Feb/12 ] |
|
I have a possibly related observation: In mongo shell, when I enter a multibyte UTF-8 character and then try to delete it, what looks like a whitespace is inserted. The number of these frankenspaces is the same as the difference between byte and "symbol" count (so delete in a string with three ä's and you'll get three extra whitespaces). If I hit delete after entering "äöü" my mongo shell looks like this: mongos> äü<space><space>_ where _ is the insert point now After hitting delete multiple times, the insert point will "catch up" again, but with careless deleting/writing, I can also generate invalid UTF-8 sequences, e.g. by entering ä<delete>üö<delete> i'll get mongos> <?>ü<space>_ where <?> is the diamond-shaped black-on-white question mark character. Seems like delete deletes bytes not characters here. (Sorry if this is duplicate or in the wrong place, just seemed to fit with "make the Mongo shell support Unicode properly for all input and output". I'm using MongoDB shell version: 2.0.2 in GNOME-Terminal 2.30.2 with character encoding set to UTF-8.) |
| Comment by auto [ 04/Jan/12 ] |
|
Author: {u'login': u'', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: Prevent sign-extension of characters that have their high bit set when |
| Comment by Tad Marshall [ 17/Nov/11 ] |
|
I am interpreting this bug to be "make the Mongo shell support Unicode properly for all input and output", meaning keyboard and display for all supported operating systems. Internally, strings will be stored in UTF-8, but that isn't the actual "feature" from the point of view of a user. I will link duplicate bug report to this ticket – this will be the "master" ticket for this feature. |
| Comment by Brandon Diamond [ 20/Oct/11 ] |
|
Transferring back to Mathias who has already done some legwork on a patch. |