[SERVER-5333] Issues with non-ASCII characters in filenames and paths in Windows Created: 19/Mar/12  Updated: 28/Apr/17  Resolved: 28/Apr/17

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tad Marshall Assignee: DO NOT USE - Backlog - Platform Team
Resolution: Duplicate Votes: 4
Labels: Windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Tested in Windows 7 with code page 437


Issue Links:
Depends
depends on SERVER-5099 Non-ASCII text on the command line is... Closed
Duplicate
duplicates SERVER-16725 Incorrect character conversion betwee... Closed
Related
related to SERVER-7496 Mongo.exe client crashes when usernam... Closed
related to DOCS-2007 tricks: "manual/reference/configurati... Closed
Operating System: Windows
Participants:

 Description   

If the path specified in --dbpath contains non-ASCII characters, the display of the path is wrong in Windows and the shell gets a "bad utf8" error when trying to display the response from "db.serverCmdLineOpts()". A customer on the mongodb-user group got errors in mongod.exe in his test, but he hasn't provided the path he used and I couldn't verify his problem with the simple path I picked.

Using a subdirectory with the single-character name U+00E1:

mongod --dbpath c:\data\á
// ... snip ...
Mon Mar 19 08:42:28 [initandlisten] options: { dbpath: "c:\data\ß" }
// ... server is now running ...

The shell has problems displaying these options:

> db.serverCmdLineOpts()
Mon Mar 19 08:51:00 decode failed. probably invalid utf-8 string [c:\data\ß]
Mon Mar 19 08:51:00      why: InternalError: buffer too small
Mon Mar 19 08:51:00 Error: invalid utf8 src/mongo/shell/utils.js:1010

The customer's problem is described at http://groups.google.com/group/mongodb-user/browse_thread/thread/703eebc432e8f925 .

mongod.exe --dbpath "\Users\test\NON-ASCII\data\db" --port 27017 --logappend --logpath "\Users\test\NON-ASCII\logs\mongod.log" --rest --vvvvv 
 
and the log shows 
 
Mon Mar 19 16:59:47 [initandlisten] User Assertion: 13518:couldn't open file /Users/test/NON-ASCII/data/db/journal/tempLatencyTest for writing errno:3 
Mon Mar 19 16:59:47 [initandlisten] info preallocateIsFaster couldn't run; returning false 
Mon Mar 19 16:59:47 [initandlisten] User Assertion: 13518:couldn't open file /Users/test/NON-ASCII/data/db/journal/j._0 for writing errno: 3 
Mon Mar 19 16:59:47 [initandlisten] exception in initAndListen: 13518 couldn't open file /Users/test/NON-ASCII/data/db/journal/j._0 for writing errno:3 
terminating 
 
and if I replace the path with ASCII words, it works fine



 Comments   
Comment by Andrew Morrow (Inactive) [ 28/Apr/17 ]

This was fixed in SERVER-16725, closing as a duplicate.

Comment by Tad Marshall [ 06/Aug/13 ]

The current line of approach for this issue is to call "boost::filesystem::path::imbue()" to tell boost::filesystem::path to internally convert UTF-8 characters to UTF-16 on Windows. The tricky part is whether we add boost::locale to our codebase or instead try to create a class that boost::filesystem::path can use in place of boost::locale.

See http://www.boost.org/doc/libs/1_50_0/libs/locale/doc/html/default_encoding_under_windows.html for the staring point for this approach.

Comment by Tad Marshall [ 26/Jun/13 ]

From looking through a lot of code, I don't think that this should be addressed by special-casing the places where the code opens files. The problem is that the code has already done existence checks and "type-of-file" (i.e. directory or normal file) checks before it reaches this point, so incorrect decisions have already been made. Once the scope is expanded to include existence checks, the amount of #ifdef _WIN32 littering and code refactoring would be a real problem.

The approach I plan to use is to ever-so-slightly abstract boost::filesystem::path and boost::filesystem::exists so that the are called through a wrapper. This wrapper would have the same API as the corresponding boost::filesystem API but would internally convert the incoming UTF-8 path into a Windows UTF-16 path before using it (for Windows) and would be a pass-through for all other OSes.

This will have some performance impact on Windows but hopefully no impact on other platforms. Since file path manipulation is not a large component of any actual database work, this should not be a problem in actual use.

I'd like this wrapper pass to be no more extensive than it needs to be to solve this problem.

A lot of code would be touched to swap in the wrapper functions, but the changes should be of the boilerplate variety and should not change code logic at all.

Comment by Tad Marshall [ 11/Mar/13 ]

This affects pretty much every bit of filename and path usage and display in Windows. SERVER-5099 will prevent us from corrupting non-ASCII text on the command line and will convert this text into UTF-8, but every place that we are using the "A" (ANSI) version of a Windows API or passing UTF-8 to Boost::filesystem or the C runtime needs to be fixed. I changed the title and raised the time estimate to account for this.

Comment by owen kao [ 25/Jun/12 ]

Which version will this issue be solved? Cause of it is an important issue of our project. Thx~

Comment by Tad Marshall [ 03/Jun/12 ]

SERVER-5099 is required to get Unicode text correctly from the command line in Windows. The remaining work in this ticket is to make sure it gets used correctly when working with the file system ... --dbpath and database names in particular.

Comment by Tad Marshall [ 16/Apr/12 ]

Part of the code for this is already written and just needs to be used by mongod.exe: getting the correct dbpath in the first place. The rest of the work is in translating the filespec to 16-bit Windows Unicode and calling the correct "wide" version of the file API.

Comment by Tad Marshall [ 19/Mar/12 ]

Here's what I think happened to Anta when he used Chinese characters in his dbpath:

1) He entered text in code page 950 which displayed fine and which was provided to mongod.exe's "ANSI" main() routine encoded in Microsoft's version of Big5, a double byte character set;

2) mongod.exe tried to use this text as if it was an ASCII or UTF-8 string to access the journal directory and bombed. The error message displayed the correct path to Anta because it was a legal string in his code page 950 character set;

In my test, the same thing happened except that the character I picked was from ISO Latin-1 and so the 8-bit encoding and the Unicode character match, so the dbpath worked fine. My problem came when the ISO Latin-1 (Unicode) character was interpreted as UTF-8 by the shell with the resulting error message and display of the character in the code page I was using for the shell (437).

All of this is fixable. All of the Windows mongodb programs should use wmain() instead of main() so they get proper Unicode (UTF-16) text and they should convert every argument to UTF-8 before passing it to boost::program_options or storing it for later use. File open operations should convert the stored UTF-8 into UTF-16 and use the wide form of the Win32 API's CreateFile (etc.) functions. That's about it. If the text is stored internally in correctly translated UTF-8, the rest should just work.

Comment by Tad Marshall [ 19/Mar/12 ]

Additional information provided by Anta Huang, the original poster in the mongodb-user group thread mentioned in the description, and also by Glenn Maynard in the same thread.

1) The character I typed (U+00E1) was being displayed as 'ß' because that is the character at position E1 in code page 437;

2) The NON-ASCII text in Anta's example was '測試' which is the characters U+6E2C U+8A66.

Generated at Thu Feb 08 03:08:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.