[SERVER-28087] MongoDB should behave identically when installed on any locale OS. Created: 19/Feb/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Akira Kurogane Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File all_non_cjk_ideograph_letter.txt    
Assigned Teams:
Query Execution
Participants:

 Description   

I found that for all practical purposes MongoDB can be installed on a unix server of any locale (English, Spanish, Chinese, whatever) without changing behaviour. Except for the following two points:

  1. There are 3 case-related aggregation functions that use the (g)libc `tolower`, `toupper`, `strcasecmp` functions.
  2. strerr() output from (g)libc that is reprinted in error responses or in the log will be in the OS locale.

#2 is no big deal, it doesn't affect the correctness of the DB in any way.

But #1, by itself, ruins our ability to say "MongoDB can be installed on a server with any locale". I was originally pleasantly surprised to figure out how little effect locale could have on a mongod process, but I found that the user was just hearing "broken", "broken", "broken" when I explained there was this one exception.

I suggest that $strcasecmp, $toUpper and $toLower be changed to use the ICU equivalents instead of the libc functions, using simple case mapping / folding per http://userguide.icu-project.org/transforms/casemappings.



 Comments   
Comment by Asya Kamsky [ 23/Feb/17 ]

Other components may be involved in addition to aggregation.

Comment by Akira Kurogane [ 20/Feb/17 ]

Attaching a UTF-8 file "all_non_cjk_ideograph_letter.txt' containing every 'letter' minus the CJK ideographs. This can be test string that we can test case folding on.

Generated using the following python:

import unicodedata
import sys
i = 0
while i < 0xFFFF: #just the basic 64k unicode char points
  if i < 0x3400 or i >= 0xF900: #exclude the CJK ideographs and some similar things
    u = chr(i)
    if unicodedata.category(u)[0:1] == "L":
      sys.stdout.write(u)
  i = i + 1

Generated at Thu Feb 08 04:17:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.