[SERVER-31152] Wrong diacritics for polish letter ł Created: 19/Sep/17  Updated: 27/Oct/23  Resolved: 07/Nov/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mateo Assignee: Kyle Suarez
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to DOCS-10976 Clarify diacritic insensitivity for t... Closed
Operating System: ALL
Steps To Reproduce:

db.t1.insert([{title: "Łódź"}])
db.getCollection("t1").createIndex({ "title": "text" }, { "name": "tq", collation:{locale:'simple'} })
db.t1.find( {$text: { $search: 'lodz' }} ) // no results
db.t2.find( {$text: { $search: 'łodz' }} ) // 1 hit

Sprint: Query 2017-10-23, Query 2017-11-13
Participants:

 Description   

According to docs https://docs.mongodb.com/manual/core/index-text/ text index
should expand non-latin characters properly. One exception I found is polish letter Ł, ł which does not expand to letter L.



 Comments   
Comment by Wojciech Jakubas [ 08/Nov/21 ]

We are 4 years later and still text search does not work with polish ł/Ł letter.

The workarounds mentioned higher up are not consistent compared to other "polish" letters like ź, ż, ó, ć, ą etc. They work OK, only ł/Ł does not.

Comment by Mateo [ 08/Nov/17 ]

I was using 3.4.3 from Opensuse 42.3 repository, but now I've upgraded to 3.4.10 from https://download.opensuse.org/repositories/server:/database/openSUSE_Leap_42.3/ and I have always the same result - 1 hit. Maybe I need to enable/disable something?

I did a fresh install on Windows machine:

db.cities.find({$text: { $search: 'lodz' }}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false}) // 1 hit
db.cities.find({$text: { $search: 'łodz' }}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false}) // 2 hits
db.cities.find({name: "lodz"}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false}) // 1hit
db.cities.find({name: "łodz"}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false}) // 0 records

------------------
I did some further tests on Debian 9 with 3.4.10 and it worked. I don't know why. What is the difference between all these systems?

Comment by Kyle Suarez [ 07/Nov/17 ]

What version of the server are you using? I tested both the current tip of master (b937ec566) and MongoDB 3.4.9, and both have identical results: all three documents are found.

> db.cities.insert([{name: "Łódź"}, {name: "łódź"}, {name: "lodz"}])
BulkWriteResult({
        "writeErrors" : [ ],
        "writeConcernErrors" : [ ],
        "nInserted" : 3,
        "nUpserted" : 0,
        "nMatched" : 0,
        "nModified" : 0,
        "nRemoved" : 0,
        "upserted" : [ ]
})
> db.cities.find({name: "lodz"}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false})
{ "_id" : ObjectId("5a02224e016fc6aafb1c3d1a"), "name" : "Łódź" }
{ "_id" : ObjectId("5a02224e016fc6aafb1c3d1b"), "name" : "łódź" }
{ "_id" : ObjectId("5a02224e016fc6aafb1c3d1c"), "name" : "lodz" }

I also tested this in the presence of a text index on the "name" field, and the query correctly ignores the index:

> db.cities.createIndex({name: "text"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.cities.find({name: "lodz"}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false})
{ "_id" : ObjectId("5a02224e016fc6aafb1c3d1a"), "name" : "Łódź" }
{ "_id" : ObjectId("5a02224e016fc6aafb1c3d1b"), "name" : "łódź" }
{ "_id" : ObjectId("5a02224e016fc6aafb1c3d1c"), "name" : "lodz" }
> db.cities.find({name: "lodz"}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "test.cities",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "name" : {
                                "$eq" : "lodz"
                        }
                },
                "collation" : {
                        "locale" : "en",
                        "caseLevel" : false,
                        "caseFirst" : "off",
                        "strength" : 1,
                        "numericOrdering" : false,
                        "alternate" : "non-ignorable",
                        "maxVariable" : "punct",
                        "normalization" : false,
                        "backwards" : false,
                        "version" : "57.1"
                },
                "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "filter" : {
                                "name" : {
                                        "$eq" : "lodz"
                                }
                        },
                        "direction" : "forward"
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "celadon",
                "port" : 27017,
                "version" : "0.0.0",
                "gitVersion" : "unknown"
        },
        "ok" : 1
}

Comment by Mateo [ 07/Nov/17 ]

So summarizing. How do I query for all 3 cities from Your third answer?

db.cities.insert([{name: "Łódź"}, {name: "łódź"}, {name: "lodz"}])
db.cities.find({name: "lodz"}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false})

This will give me one hit. Quering text index with $language: 'en' will also give one hit.

Comment by Kyle Suarez [ 07/Nov/17 ]

Hi vikingpl,

Looks like I chose my collations poorly in my first attempt. The Polish collation will treat L and Ł as distinct, as will the POSIX locale. In MongoDB, you can achieve the desired results by using a collation locale where "Ł" is not a standard character**, like English:

> assert.writeOK(db.test.insert({val: "Ł"}));
WriteResult({ "nInserted" : 1 })
> assert.writeOK(db.test.insert({val: "ł"}));
WriteResult({ "nInserted" : 1 })
> assert.writeOK(db.test.insert({val: "L"}));
WriteResult({ "nInserted" : 1 })
> assert.writeOK(db.test.insert({val: "l"}));
WriteResult({ "nInserted" : 1 })
> db.test.find({val: "l"}).collation({locale: "en", strength: 1, caseLevel: false, caseFirst: "off", backwards: false})
{ "_id" : ObjectId("5a01e5dd9041761b857f12e1"), "val" : "Ł" }
{ "_id" : ObjectId("5a01e5dd9041761b857f12e2"), "val" : "ł" }
{ "_id" : ObjectId("5a01e5dd9041761b857f12e3"), "val" : "L" }
{ "_id" : ObjectId("5a01e5dd9041761b857f12e4"), "val" : "l" }

However, I will say that the Unicode collation algorithm is completely distinct from our diacritic-stripping algorithm. Going back to your original request, after examining the behavior of text search, I would say that the current diacritic-stripping algorithm is correct with regard to the letter Ł. If the Unicode Consortium is not going to specify a decomposition for Ł, I don't think that MongoDB should attempt to make one up, even if it "seems like" it should.

Given all this, I'm going to close this ticket as Works as Designed. Collation provides an alternative to text search for this particular peculiarity.

Best,
Kyle

**I found which locales would work by running this script and seeing which collations found all four results: https://gist.github.com/ksuarz/2e801814459ee2a013738cdf8c5ae9ef

Comment by Mateo [ 03/Nov/17 ]

Try

 SELECT * FROM cities WHERE name collate utf8_unicode_520_ci LIKE 'lodz';
 
+----+------+
| id | name |
+----+------+
|  1 | Łódź |
|  2 | łódź |
|  3 | lodz |
+----+------+
3 rows in set (0.00 sec)

By default in MySQL/MariaDB utf8_unicode_ci is based on Unicode-4.x, so it inherits all Unicode-4.x features.
In later versions, Unicode changed the default weight table. In Unicode-5.x, 'l' and 'ł' are already primary equal letters:
ftp://ftp.unicode.org/Public/UCA/5.1.0/allkeys.txt

Comment by Kyle Suarez [ 03/Nov/17 ]

Hi vikingpl,

Which collation are you using? I haven't been able to perform a search that treats "L" and "Ł" the same in both MongoDB and MariaDB.

We can't use the Polish locale, since that locale will definitely treat the two letters as distinct no matter what the strength. In MongoDB, I tried en_US_POSIX:

> db.cities.insert([{name: "Łódź"}, {name: "łódź"}, {name: "lodz"}])
BulkWriteResult({
        "writeErrors" : [ ],
        "writeConcernErrors" : [ ],
        "nInserted" : 3,
        "nUpserted" : 0,
        "nMatched" : 0,
        "nModified" : 0,
        "nRemoved" : 0,
        "upserted" : [ ]
})
 
> db.cities.find({name: "lodz"}).collation({locale: "en_US_POSIX", strength: 1})
{ "_id" : ObjectId("59fc7280cac96cc79a0a9cec"), "name" : "lodz" }

In MariaDB, I tried both utf8_general_ci and utf8_unicode_ci:

MariaDB [poland]> SELECT * FROM cities;
+---------+
| name    |
+---------+
| Łódź    |
| łódź    |
| lodz    |
+---------+
3 rows in set (0.00 sec)
 
MariaDB [poland]> SELECT * FROM cities WHERE name = "lodz" COLLATE utf8_unicode_ci;                                                                                                                                                           
+------+                                                                                                                                                                                                                                      
| name |
+------+
| lodz |
+------+
1 row in set (0.00 sec)
 
MariaDB [poland]> SELECT * FROM cities WHERE name LIKE "lodz" COLLATE utf8_unicode_ci;                                                                                                                                                        
+------+                                                                                                                                                                                                                                      
| name |
+------+
| lodz |
+------+
1 row in set (0.00 sec)
 
MariaDB [poland]> SELECT * FROM cities WHERE name = "lodz" COLLATE utf8_general_ci;                                                                                                                                                           
+------+                                                                                                                                                                                                                                      
| name |
+------+
| lodz |
+------+
1 row in set (0.00 sec)
 
MariaDB [poland]> SELECT * FROM cities WHERE name LIKE "lodz" COLLATE utf8_general_ci;                                                                                                                                                        
+------+                                                                                                                                                                                                                                      
| name |
+------+
| lodz |
+------+
1 row in set (0.00 sec)

It's worth noting that MongoDB only uses the official Unicode collation algorithm. There are other custom collation algorithms supported by Oracle MySQL, but those are totally non-standard.

Comment by Mateo [ 03/Nov/17 ]

Thanks for the clarification.
But how does it work in other DB systems like MySQL? They use ftp://ftp.unicode.org/Public/UCA/5.1.0/allkeys.txt, for example

006C  ; [.1262.0020.0002.006C] # LATIN SMALL LETTER L
0142  ; [.1262.0020.0002.006C][.0000.008D.0002.0335] # LATIN SMALL LETTER L WITH STROKE; QQCM

and with proper collation set there is no problem to search Ł because weight is the same as for letter L.

Comment by Kyle Suarez [ 02/Nov/17 ]

Given my findings above, I'm keeping this ticket in the "Needs Triage" queue so that it can be reconsidered by the Query Team as a whole during the next planning meeting.

Comment by Kyle Suarez [ 02/Nov/17 ]

Hi vikingpl,

Sorry for the delay in getting to this issue. To strip diacritics, MongoDB applies some transformations specified in the Unicode standard. To be specific, we decompose characters using NFD, remove characters that are classified as diacritics, and then re-compose the result with NFC. For example, this transformation breaks down Ç into C + ◌̧, removes the combining cedilla character and results in C.

However, there are certain characters, like Ł (LATIN CAPITAL LETTER L WITH STROKE), for which there is no simpler Unicode decomposition even if you might expect it. MongoDB uses Unicode version 8.0, but this excerpt from Unicode 10.0 is still informative:

Non-decomposition of Certain Diacritics

Most characters that one thinks of as being a letter “plus accent” have formal decompositions in the Unicode Standard... There are, however, exceptions involving certain types of diacritics and other marks.

Overlaid and Attached Diacritics. Based on the pattern for accented letters, implementers often also expect to encounter formal decompositions for characters which use various overlaid diacritics such as slashes and bars to form new Latin (or Cyrillic) letters. For example, one might expect a decomposition for U+00D8 latin capital letter o with stroke involving U+0338 combining long solidus overlay. However, such decompositions involving overlaid diacritics are not formally defined in the Unicode Standard. For historical and implementation reasons, there are no decompositions for characters with overlaid diacritics such as slashes and bars, nor for most diacritic hooks, swashes, tails, and other similar modifications to the graphic form of a base character. In such cases, the generic identification of the overlaid element is not specific enough to identify which part of the base glyph is to be overlaid.

I can understand if you'd like a feature request where we expand our diacritic-stripping algorithm to also handle characters like Ł. However, this will probably involve a text index version bump and we presently don't have plans to take it upon ourselves to classify what is a diacritic and what is not beyond the current Unicode standard.

I've opened DOCS-10976 so that we can clarify in our documentation which diacritics are stripped in the present text index version 3.

Regards,
Kyle

Comment by Mark Agarunov [ 19/Sep/17 ]

Hello vikingpl,

Thank you for the report. I've managed to reproduce this behavior using the detailed steps provided. I've set this ticket to Needs Triage to be scheduled against our currently planned work, please watch this ticket for updates on this issue.

Thanks,
Mark

Generated at Thu Feb 08 04:26:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.