[SERVER-5099] Non-ASCII text on the command line isn't handled well in Windows Created: 26/Feb/12 Updated: 11/Jul/16 Resolved: 16/Mar/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 2.5.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tad Marshall | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 1 |
| Labels: | Windows | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows command line with text that isn't completely US-ASCII |
||
| Issue Links: |
|
||||||||||||||||
| Operating System: | Windows | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Any text characters above 0x7F entered on the command line for mongod.exe, mongos.exe, mongo.exe and the other programs in the suite are not necessarily being handled correctly in Windows. Although we build the Windows versions with UNICODE and _UNICODE defined, the entry point we declare is main() and this gets us text in the 8-bit code page of the invoking command window. We would need to change the entry point to wmain() to get a wide-character UTF-16 string, and this would then require using a wide version of boost::program_options to parse the 16-bit characters. The misbehavior that is seen will depend on the code page of the invoking command window. In US English versions of Windows, you get the DOS-compatible code page 437 if you haven't changed your configuration. In Western European versions of Windows you may get code page 1252 which is the same as ISO Latin 1 and so the same as Unicode for characters up to 0xFF. Beyond these issues, there may be instances where data isn't handled correctly: I found and am fixing a few I found in the Windows Service code. We were getting sign-extension of characters between 0x80 and 0xFF, which turned 0xE1 ("LATIN SMALL LETTER A WITH ACUTE", 'á') into U+FFE1 (displays as "FULLWIDTH POUND SIGN", '£'). This may not be an issue for some users (US-only, or European/UK users using code page 1252) but the issue is likely to pop up repeatedly until we make the code fully Unicode-capable. |
| Comments |
| Comment by auto [ 16/Mar/13 ] |
|
Author: {u'date': u'2013-03-16T16:51:04Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: For the Windows version of mongobridge, bsondump, docgenerator, mongodump, |
| Comment by auto [ 15/Mar/13 ] |
|
Author: {u'date': u'2013-03-15T06:55:57Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: For the Windows version of test, switch to a Unicode "wmain()" entry |
| Comment by auto [ 15/Mar/13 ] |
|
Author: {u'date': u'2013-03-14T20:58:56Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: For the Windows version of mongos, switch to a Unicode "wmain()" entry |
| Comment by auto [ 14/Mar/13 ] |
|
Author: {u'date': u'2013-03-13T19:13:04Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: For the Windows version of mongod, switch to a Unicode "wmain()" entry |
| Comment by auto [ 13/Mar/13 ] |
|
Author: {u'date': u'2013-03-13T16:52:49Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: |
| Comment by auto [ 13/Mar/13 ] |
|
Author: {u'date': u'2013-03-13T14:21:56Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: Implement support for the third parameter (environment) of main()/wmain() |
| Comment by Tad Marshall [ 21/Mar/12 ] |
|
I have made this change to mongo.exe as part of making UTF-8 work right in the shell ( |
| Comment by Tad Marshall [ 26/Feb/12 ] |
|
A small simplification to what I described in the problem statement would be to change the Windows executables to start at wmain() to get the wide-character Unicode text from Windows, and then convert it to UTF-8 before processing it. We could do the whole command line and then parse it into argc and argv, or we could convert the argv components one at a time. This would let us stay with the 8-bit-character version of boost::program_options and make the processing code more similar between Windows and non-Windows versions. Since we want Unicode to work correctly on Windows, we would then just need to translate from UTF-8 into Windows-style UTF-16 wide characters before using the text in any Windows API. I don't know about boost::file_operation (what happens when you pass a UTF-8 encoded string to the Windows version) but it might "just work" ... we'll need a bunch of manual testing to see what we can get away with. |