[SERVER-16725] Incorrect character conversion between UTF-8 and UTF-16 Created: 05/Jan/15  Updated: 08/Jan/24  Resolved: 16/Sep/16

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 3.3.14

Type: Bug Priority: Major - P3
Reporter: Spencer Jackson Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: 28qa, locale, utf-8
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-2863 Support UTF-8 paths on Windows Closed
Duplicate
duplicates SERVER-20697 Running 'mongod.exe' with '--dbpath' ... Closed
is duplicated by SERVER-5333 Issues with non-ASCII characters in f... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: Windows
Steps To Reproduce:

On Windows, start a mongod with directoryperdb. Create a database with a single multibyte UTF-8 character as its name. Insert a document.

Sprint: Platforms 2016-08-26, Platforms 2016-09-19
Participants:

 Description   

The use of UTF-8 unicode characters in a database name will cause creation of directories with directoryperdb to fail.

Because the BSON spec defines strings to be stored in UTF-8, strings in the
server are also UTF-8. Windows, however, uses UTF-16 for its implementation of
unicode, and as inputs for its APIs. This means that we must convert between our internally used 8 bit characters
and Windows 16 bit characters before API calls are made. For file operations, we do this in two ways. mongo::File is the first.
When open is called on a path, MultiByteToWideChar is called on the path, converting the UTF-8 encoded string to UTF-16.
The second is through boost::filesystem::path. This class uses C++'s locale system. std::locale is an object which specifies
different properties which a localization might have. These properties are called facets. One such facet is the codecvt, which handles
conversion between different types of strings. The boost::filesystem::path instantiates a copy of the global std::locale, and overrides its
codecvt with a custom converter object. This locale is then saved globally for use in path operations. When a path is created, or
appended to, the codecvt is used, if necessary, to convert the provided string into the operating system's default character format.
The original std::locale is left as is. Unfortunately, boost::filesystem's implementation of the codecvt, windows_file_codecvt, is incomplete.
It will set the 8 bit character's code page to either ANSII, or the OS's OEM codepage. This means the conversion will be invalid.

Because two mechanisms are used, it appears that we are creating an incorrect directory name, using boost::filesystem::path, creating that incorrect directory, then attempting to create a file in the correct path. The directory in the file path will not exist, and file creation will fail.

FileAllocator's makeTempFileName and run functions will need to be modified. makeTempFileName produces a path as a string. Though it uses boost::filesystem::path internally, it translates the path back into 8 bit characters when it converts to std::string. run then uses c_str on said std::string without any width conversion.

A plausible solution to this might be to use boost's locale library to generate a new std::locale object with a correct codecvt, as per the
boost filesystem documentation here: http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/default_encoding_under_windows.html



 Comments   
Comment by Githook User [ 16/Sep/16 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-16725 Incorrect character conversion between UTF-8 and UTF-16
Branch: master
https://github.com/mongodb/mongo/commit/f0d958c747cfc42dd831eb2f088e963475c0ed54

Generated at Thu Feb 08 03:42:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.