[SERVER-5333] Issues with non-ASCII characters in filenames and paths in Windows Created: 19/Mar/12 Updated: 28/Apr/17 Resolved: 28/Apr/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Shell |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tad Marshall | Assignee: | DO NOT USE - Backlog - Platform Team |
| Resolution: | Duplicate | Votes: | 4 |
| Labels: | Windows | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Tested in Windows 7 with code page 437 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
If the path specified in --dbpath contains non-ASCII characters, the display of the path is wrong in Windows and the shell gets a "bad utf8" error when trying to display the response from "db.serverCmdLineOpts()". A customer on the mongodb-user group got errors in mongod.exe in his test, but he hasn't provided the path he used and I couldn't verify his problem with the simple path I picked. Using a subdirectory with the single-character name U+00E1:
The shell has problems displaying these options:
The customer's problem is described at http://groups.google.com/group/mongodb-user/browse_thread/thread/703eebc432e8f925 .
|
| Comments |
| Comment by Andrew Morrow (Inactive) [ 28/Apr/17 ] |
|
This was fixed in |
| Comment by Tad Marshall [ 06/Aug/13 ] |
|
The current line of approach for this issue is to call "boost::filesystem::path::imbue()" to tell boost::filesystem::path to internally convert UTF-8 characters to UTF-16 on Windows. The tricky part is whether we add boost::locale to our codebase or instead try to create a class that boost::filesystem::path can use in place of boost::locale. See http://www.boost.org/doc/libs/1_50_0/libs/locale/doc/html/default_encoding_under_windows.html for the staring point for this approach. |
| Comment by Tad Marshall [ 26/Jun/13 ] |
|
From looking through a lot of code, I don't think that this should be addressed by special-casing the places where the code opens files. The problem is that the code has already done existence checks and "type-of-file" (i.e. directory or normal file) checks before it reaches this point, so incorrect decisions have already been made. Once the scope is expanded to include existence checks, the amount of #ifdef _WIN32 littering and code refactoring would be a real problem. The approach I plan to use is to ever-so-slightly abstract boost::filesystem::path and boost::filesystem::exists so that the are called through a wrapper. This wrapper would have the same API as the corresponding boost::filesystem API but would internally convert the incoming UTF-8 path into a Windows UTF-16 path before using it (for Windows) and would be a pass-through for all other OSes. This will have some performance impact on Windows but hopefully no impact on other platforms. Since file path manipulation is not a large component of any actual database work, this should not be a problem in actual use. I'd like this wrapper pass to be no more extensive than it needs to be to solve this problem. A lot of code would be touched to swap in the wrapper functions, but the changes should be of the boilerplate variety and should not change code logic at all. |
| Comment by Tad Marshall [ 11/Mar/13 ] |
|
This affects pretty much every bit of filename and path usage and display in Windows. |
| Comment by owen kao [ 25/Jun/12 ] |
|
Which version will this issue be solved? Cause of it is an important issue of our project. Thx~ |
| Comment by Tad Marshall [ 03/Jun/12 ] |
|
|
| Comment by Tad Marshall [ 16/Apr/12 ] |
|
Part of the code for this is already written and just needs to be used by mongod.exe: getting the correct dbpath in the first place. The rest of the work is in translating the filespec to 16-bit Windows Unicode and calling the correct "wide" version of the file API. |
| Comment by Tad Marshall [ 19/Mar/12 ] |
|
Here's what I think happened to Anta when he used Chinese characters in his dbpath: 1) He entered text in code page 950 which displayed fine and which was provided to mongod.exe's "ANSI" main() routine encoded in Microsoft's version of Big5, a double byte character set; 2) mongod.exe tried to use this text as if it was an ASCII or UTF-8 string to access the journal directory and bombed. The error message displayed the correct path to Anta because it was a legal string in his code page 950 character set; In my test, the same thing happened except that the character I picked was from ISO Latin-1 and so the 8-bit encoding and the Unicode character match, so the dbpath worked fine. My problem came when the ISO Latin-1 (Unicode) character was interpreted as UTF-8 by the shell with the resulting error message and display of the character in the code page I was using for the shell (437). All of this is fixable. All of the Windows mongodb programs should use wmain() instead of main() so they get proper Unicode (UTF-16) text and they should convert every argument to UTF-8 before passing it to boost::program_options or storing it for later use. File open operations should convert the stored UTF-8 into UTF-16 and use the wide form of the Win32 API's CreateFile (etc.) functions. That's about it. If the text is stored internally in correctly translated UTF-8, the rest should just work. |
| Comment by Tad Marshall [ 19/Mar/12 ] |
|
Additional information provided by Anta Huang, the original poster in the mongodb-user group thread mentioned in the description, and also by Glenn Maynard in the same thread. 1) The character I typed (U+00E1) was being displayed as 'ß' because that is the character at position E1 in code page 437; 2) The NON-ASCII text in Anta's example was '測試' which is the characters U+6E2C U+8A66. |