Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16725

Incorrect character conversion between UTF-8 and UTF-16

    • Fully Compatible
    • Windows
    • Hide

      On Windows, start a mongod with directoryperdb. Create a database with a single multibyte UTF-8 character as its name. Insert a document.

      Show
      On Windows, start a mongod with directoryperdb. Create a database with a single multibyte UTF-8 character as its name. Insert a document.
    • Platforms 2016-08-26, Platforms 2016-09-19

      The use of UTF-8 unicode characters in a database name will cause creation of directories with directoryperdb to fail.

      Because the BSON spec defines strings to be stored in UTF-8, strings in the
      server are also UTF-8. Windows, however, uses UTF-16 for its implementation of
      unicode, and as inputs for its APIs. This means that we must convert between our internally used 8 bit characters
      and Windows 16 bit characters before API calls are made. For file operations, we do this in two ways. mongo::File is the first.
      When open is called on a path, MultiByteToWideChar is called on the path, converting the UTF-8 encoded string to UTF-16.
      The second is through boost::filesystem::path. This class uses C++'s locale system. std::locale is an object which specifies
      different properties which a localization might have. These properties are called facets. One such facet is the codecvt, which handles
      conversion between different types of strings. The boost::filesystem::path instantiates a copy of the global std::locale, and overrides its
      codecvt with a custom converter object. This locale is then saved globally for use in path operations. When a path is created, or
      appended to, the codecvt is used, if necessary, to convert the provided string into the operating system's default character format.
      The original std::locale is left as is. Unfortunately, boost::filesystem's implementation of the codecvt, windows_file_codecvt, is incomplete.
      It will set the 8 bit character's code page to either ANSII, or the OS's OEM codepage. This means the conversion will be invalid.

      Because two mechanisms are used, it appears that we are creating an incorrect directory name, using boost::filesystem::path, creating that incorrect directory, then attempting to create a file in the correct path. The directory in the file path will not exist, and file creation will fail.

      FileAllocator's makeTempFileName and run functions will need to be modified. makeTempFileName produces a path as a string. Though it uses boost::filesystem::path internally, it translates the path back into 8 bit characters when it converts to std::string. run then uses c_str on said std::string without any width conversion.

      A plausible solution to this might be to use boost's locale library to generate a new std::locale object with a correct codecvt, as per the
      boost filesystem documentation here: http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/default_encoding_under_windows.html

            Assignee:
            mark.benvenuto@mongodb.com Mark Benvenuto
            Reporter:
            spencer.jackson@mongodb.com Spencer Jackson
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: