[SERVER-1920] Sort by collation Created: 11/Oct/10  Updated: 06/Dec/22  Resolved: 23/Aug/16

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Querying
Affects Version/s: 1.6.3
Fix Version/s: 3.3.11

Type: New Feature Priority: Major - P3
Reporter: ppalka Assignee: Backlog - Query Team (Inactive)
Resolution: Done Votes: 122
Labels: query_triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by DRIVERS-291 Support providing collation per opera... Closed
is depended on by SERVER-90 case insensitive index Closed
Related
related to CXX-290 Problem with Query & hint (const stri... Closed
related to DOCS-7503 Document the collation project (part 1) Closed
is related to SERVER-9367 toLowerCase() function does not work ... Closed
Assigned Teams:
Query
Backwards Compatibility: Fully Compatible
Participants:
Case:

 Description   
Issue Status as of August 23, 2016

ISSUE SUMMARY

Version 3.3.11 of MongoDB introduces support for unicode-aware string comparisons, allowing users to issue queries that sort and match UTF-8 encoded string data in a locale-aware fashion. The server will accept a collation document specifying the locale, amongst other properties of the string comparator, such as diacritic sensitivity and case sensitivity. The collation can be attached at the operation level to a particular query. Alternatively, a default collation can be specified at collection creation time which will be used by all operations over the collection.

TECHNICAL DETAILS

Syntax for specifying a collation

The collation is specified with a document of the following form:

collation: {
    locale: <string>,
    caseLevel: <bool>,
    caseFirst: <string>,
    strength: <int>,
    numericOrdering: <bool>,
    alternate: <string>,
    maxVariable: <string>,
    normalization: <bool>,
    backwards: <bool>
}

All fields are optional, except for the locale field, which is required. The list of supported locales as well as documentation of all collation options is available here: Development Series 3.3.x Collation.

Supported operations

A collation can be attached at the operation level to the following commands:

  • aggregate
  • count
  • distinct
  • find
  • findAndModify
  • geoNear
  • group
  • mapReduce
  • remove
  • update

If the collation is omitted, then the collection's default collation will be used.

An operation with a collation will use the collation for all string comparisons of stored data. If, for example, an aggregation is issued with a $match stage followed by a $sort stage with the diacritic-insensitive French collation, then the server will apply the diacritic-insensitive French semantics to both the match and the sort.

Index support

A collation can also be associated with an index at index creation time. Indexes with a collation can support string matching and string sorting operations if the collation associated with the index is identical to the index associated with the query. The following index types accept a collation at index build time:

  • btree
  • 2dsphere

Index builds issued against a collection with a default collation will inherit the collection default unless an overriding collation is specified explicitly on the createIndex command.

Example

The following example demonstrates how to use the mongo shell to sort strings using French Canadian comparison rules:

> db.myColl.insert([{_id: 1, "term": "cote"}, {_id: 2, "term": "coté"}, {_id: 3, "term" : "côte"}, {_id: 4, "term" : "côté"}]);
> db.myColl.find().sort({"term": -1}).collation({"locale": "fr_CA"});
{ "_id" : 4, "term" : "côté" }
{ "_id" : 2, "term" : "coté" }
{ "_id" : 3, "term" : "côte" }
{ "_id" : 1, "term" : "cote" }

Note that the order in which the result set is sorted would be different without the .collation() modifier, as the fr_CA locale includes the backwards option by default, enabling special French comparison rules for diacritical marks.

More details

For more thorough technical documentation, please refer to the documentation.

IMPACT ON DOWNGRADE

Downgrade from 3.4 to 3.2 is illegal if the data files contain any collections or indices with a collation. Before downgrading, all collections and indices with an associated collation must be dropped.

FURTHER INFORMATION

Documentation for this feature is available in the 3.3.x development series release notes. To join our beta program for Collation Support in MongoDB, and suggest improvements to our implementation, please email beta@mongodb.com.

Original description

I need to properly mongodb sorting characters that are in the wrong order when sorting in utf-8. MySQL has an option to "collation" by which we can set that properly were also ordered list of results by the Polish characters, eg: by utf8_polish_ci



 Comments   
Comment by Viktor Hedefalk [ 23/Aug/16 ]

Yey! Seeing this is resolved makes me super happy! Great!

Comment by Piotr Wilkin [ 16/Jun/15 ]

Could we please get some status updates on this? For many non-English speaking users, this is a pretty important issue.

Comment by Chris Hirt [ 11/May/15 ]

+1 Just like to say this issue is really important for us! Wish I had known before we chose Mongo! Storing the sort key as others have mentioned is our strategy going forward, although I haven't worked out yet how long that will take to implement. Our server-side application is written in PHP, and so I found this blog post on persisting sort keys helpful. http://derickrethans.nl/mongodb-collation.html

Comment by Paco Hernández [ 07/Jan/15 ]

I have seen that WiredTiger has an interface that allows to provide custom ordering:
http://source.wiredtiger.com/2.3.1/struct_w_t___c_o_l_l_a_t_o_r.html

https://github.com/mongodb/mongo/blob/de54755e568481d1bdef37339d899403e3b04d86/src/mongo/db/storage/wiredtiger/wiredtiger_index.cpp

Would be possible to implement a patch for the new WiredTiger engine in MongoDB 2.8?

Thank you.

Comment by Nikita Dedik [ 26/Nov/14 ]

Very sad that there's no reaction from MongoDB developers on such an important issue for YEARS.

Comment by Ismet Ozalp [ 24/Oct/14 ]

Please at least tell community when this is going to be released. From my perspective this will not be resolved any time soon, and I am not even sure whether it is going to be resolved or not. The problem is people are investing so much on mongodb before even realizing mongodb is not capable of sorting multilingual strings. Currently we can not do server side paging due to sort order of documents. Which makes very hard to develop scalable applications on some specific cases. If I had known this before I would not use mongodb for my current project in the first place. There is a comment a few months ago telling 10gen wants it in 2.6 release but still it is not even scheduled for any version. I am not a pessimist guy but seeing this issue started on 2010 I am a little hopeless and frankly I like mongodb very much and I want to use mongodb wherever possible, but with out this feature sometimes we are having so much pain, which defeats the purpose of using mongodb.
Please shed some light on this matter, we are a little desperate here.

Comment by J. Cardina [ 18/Aug/14 ]

Guys this issue was opened in 2010, it's just not going to happen. We need to find an alternative. Any suggestions welcomed at this point.

Comment by Dieter  Guendisch [ 18/Aug/14 ]

Maybe not utf8_general_ci but more utf8_unicode_ci.
We need that feature as well as we have customers from outside the US

Comment by Muhammad Hussein Fattahizadeh [ 30/Jul/14 ]

+1 ... like mysql we need utf8_general_ci for sort correctly for all languages.
i have sort problem for persian.

Comment by Martin Flower [ 07/Jul/14 ]

77 votes for this issue now

Comment by Miko?aj Michalczyk [ 27/Jun/14 ]

+1

Comment by Petr Novak [ 29/May/14 ]

Related to https://jira.mongodb.org/browse/SERVER-153

Comment by Petr Novak [ 26/May/14 ]

4 years and no progress on this feature required for most internationalized applications, Oracle, MySQL, PostgreSQL, etc. have some support.

Now we are thinking about move from MongoDB to CouchDB - there is some basic support for String sorting - http://docs.couchdb.org/en/latest/couchapp/views/collation.html

Please, provide support for http://unicode.org/reports/tr10/, some integration of http://icu-project.org/ with http://icu-project.org/apiref/icu4c/ - http://icu-project.org/apiref/icu4c/classicu_1_1Collator.html could help to solve this issue.

The basic solution could be based on idea (http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch5lingsort.htm#i1006421) - I want to create some special "Collation" kind of MongoDB index - this index calculates the CollationKey (with some settings) for attribute and than this key will be stored in index and used for Sort and Compare String attributes.

We are now testing (to compare) the solution based on CouchDB - Collation View and a workaround for MongoDB based on synthetic "column/attribute" calculated in application by http://docs.oracle.com/javase/7/docs/api/java/text/CollationKey.html and than stored in JSON document and indexed by mongodb.
The CollationKey attribute is working, but it is not ideal, because it allocates too much space - ck_attribute+ck_index and all queries must be rewritten to reflect this synthetic attribute for Sort and Compare.

Comment by Nicholas Marshall [ 20/May/14 ]

Came here to report this bug, glade to see that it's known. Chinese names are not sorting correctly.

Comment by NOVALUE Mitar [ 08/Apr/14 ]

Hamilton: To which one?

Comment by Hamilton Vera [ 08/Apr/14 ]

Hi Petterson, same problem here.

We are already porting our mongo data to another DB.

Comment by Petterson Andrade [ 08/Apr/14 ]

I don't know the Bug priority but I think this is very critical for Brazilian users. I made tests using Ç e à and sort return wrong answer. Our system use 'sort' command for many collections. Our system is in production and many customer already have seeing this problem.

Comment by Daniel Pasette (Inactive) [ 22/Mar/14 ]

There are many features in the backlog that need to be considered and done in the proper order according to resources and priority. Despite it being high on the list, I can't give a firm date for when this feature will be complete.

Comment by J. Cardina [ 21/Mar/14 ]

Dan? I guess we have our answer how seriously this is being taken.

Comment by J. Cardina [ 09/Mar/14 ]

Good to hear Dan, when will this work be started and when will it be completed?

Comment by Daniel Pasette (Inactive) [ 05/Mar/14 ]

We at MongoDB very much want this feature too; we wanted to have this ready for the 2.6 release, but weaving this functionality into the server is a significant undertaking and we want to ensure we have all the bases covered. User-configurable collation support requires careful consideration of the interface and its performance impact on the database. This project will end up touching many core components. We need to make modifications to the query and aggregation language, to add index specification extensions and a versioning scheme, and finally to implement the ability to dynamically load the ICU third-party library for all the platforms we support, including Windows.

Comment by Søren Boll Overgaard [ 04/Mar/14 ]

Honestly, this is becoming embarrassing.
Around these parts (non-english speaking parts of the world I suppose) MongoDB is becoming known as the database which can't sort.
Because of this, we are now forced to do all sorting and querying on a SOLR instance, before retrieving (and manually re-sorting) objects from MongoDB.

Would you at least consider making this shortcoming clear in the MongoDB documentation until such a time as the problem is fixed?

Comment by Viktor Hedefalk [ 28/Feb/14 ]

+1. I do not recommend anyone to use mongodb anymore. I thought this was going to be fixed fast since its basically a deal-breaker for anything I ever used a DB for.

Comment by J. Cardina [ 27/Feb/14 ]

This is never going to be implemented is it?

Every time I get some email from Mongodb touting some new release or some new conference or other new B.S. I check to see if this has been implemented and...of course....nothing and I realize more and more that the powers that be just don't care about this issue which is truly odd since it's so critical for nearly any modern business application.

Well I'm getting the message loud and clear: you don't care about it and you are not interested in supporting users who need it.

I advise anyone who is contemplating using Mongo and requires this glaring omission to be rectified to consider looking elsewhere. It appears that they are simply not interested in supporting this feature and will not put any effort or resources towards it beyond lip service only.

I advise whoever triages this stuff to simply close this issue and make your intentions perfectly clear so no other developers waste time with this mess.

Comment by Harald Lapp [ 13/Nov/13 ]

I find it really frustrating, that this issue (and the lower case sorting) get's pushed back and back again in version. Considering that this ticket was created three years ago i really wonder, if it will ever be solved. Is correct sorting behaviour really that unimportend for a database product? sorry for the rant ... but ... i find it really frustrating ...

Comment by Minh Nguyen [ 18/Oct/13 ]

I have this problem with utf-8 general ci.

Comment by J. Cardina [ 22/Aug/13 ]

Any update on this? If it was "near the top of the priority list" months ago surely it's at the very top now after all the updates I've seen go by without it.

Comment by Andres Jaimes [ 28/Jun/13 ]

One more vote for this!

Comment by Eric Milkie [ 04/Jun/13 ]

It's near the top of the priority list, but the exact release date has not yet been set.

Comment by J. Cardina [ 03/Jun/13 ]

This issue drastically limits the usefulness of MongoDB for public facing software. I'm not familiar with this tracking system so forgive me if this is spelled out and I'm missing it but: Is there any time frame or even interest for implementing this? All I see is the last comment was last year and it has a major priority but I don't know what that means in practical terms and I have to decide if MongoDB is a dead end now for us.

Is there any possibility this will be implemented in the next 6 months?

Comment by Tuner [ 21/Dec/12 ]

@Florian, @Daniel +1

Comment by Daniel Walter [ 21/Dec/12 ]

Indeed, utf8_general_ci would be a big step forward and help most of the applications out there.

However, there are several regions with special cases like e.g. utf8_swedish_ci or utf8_spanish_ci that would be let down by using a hard-coded collation. So I'd second Florian's suggestion to use some sort of configuration, may it be LC_COLLATE or even on DB level with the ability to set it for each database separately.

Comment by Florian Sesser [ 20/Dec/12 ]

@Eliot, utf8_general_ci is better than memcmp(), but I wouldn't count it as a big step forward. Some configurability needs to be there IMHO.

What could be "good enough" for many is "use the system locale", that is, use the collation of the surrounding environment locale (LC_COLLATE) for all of this mongodb instance's databases.

Comment by Tuner [ 20/Dec/12 ]

Hard to say if it's "good enough". Here is a "bug" in Polish language when using utf8_general_ci: http://stackoverflow.com/a/3506373

Comment by Eliot Horowitz (Inactive) [ 20/Dec/12 ]

For those who need this, is the functionality of utf8_general_ci in mysql good enough?

Comment by Florian Sesser [ 17/Oct/12 ]

Full ACK, Tuner!

Comment by Tuner [ 17/Oct/12 ]

This issue IMO is the most important lack of feature for the moment. For users lack of sorting by collation (case-insensitive for example) must be really big surprise.

If anybody would be interested in a fix. Just rename the folder with "MongoDB" to "Mongo-dont-expect-too-much-almost-DB"

Comment by Florian Sesser [ 17/Oct/12 ]

Tuner, thank you! But beware:

  • This code is demo quality (i.e., do not use it). I merely wrote it to show that it cannot be that hard.
  • If enough people vote the issue high enough, 10gen will prio it higher and solve it sooner. This is democracy, it seems: But not enough people care.
  • Alas, implementing this in MongoDB should solve multiple issues (at least SERVER-1920 and SERVER-90) and also help with others (like SERVER-380), maybe the votes for these tickets should get aggregated. 10gen, use your new aggregation framework if you like

Still – If you need to work with Unicode strings, either use MongoDB as a dumb data store and put the intelligence (i.e., libICU) into your application or use another product (like Apache Solr) to get the jobs you need done together with MongoDB.

Comment by Tuner [ 17/Oct/12 ]

Florian, this is great! If it's that simple why 10gen is that stubborn to not to use it? Can somebody explain me this crazy situation?

Comment by Florian Sesser [ 17/Oct/12 ]

For kicks I integrated libICU into mongod here on GitHub (see example run), alas with POC quality and only for sort() and not for the generation of indices. And without a real interface (I'd set the collation using the environment).

At MongoDBMunich yesterday, Mathias Stearn/10gen recommended me to rather use libICU in my application, and then save libICU-generated sort keys to the DB. Then, generate an index and sort using that.

I'll go with that strategy. Maybe the hint here will help others.

Comment by liugen [ 19/Sep/12 ]

I need this feature too.
Hope 10gen can do it quickly.

Comment by liugen [ 19/Sep/12 ]

I need this feature too.
Hope 10gen can do it quickly.

Comment by Tuner [ 01/Apr/12 ]

I hope this will be available, in other way search on string fields dosen't make much sense. When using it in BI, it's something needed by customers. How hard is it to implement?

Comment by John Crenshaw [ 04/Dec/10 ]

The best bet is probably to use the ICU library, which fully implements the official Unicode Collation Algorithm.

http://userguide.icu-project.org/packaging
http://userguide.icu-project.org/icufaq

ICU is provided under a simple GPL compatible BSD like license, so I don't think you'll have any problems there.

If you add this as an index option, ICU supports sort key generation (I.E. you can create a single value which represents the string to be compared) which could be helpful.

The best bet is probably to allow a default at the database level, with optional field level overrides (similar to how MySQL handles this.)

Will mongos also need access to collation data? (to do the final mergesort on queries that hit several shards?)

Generated at Thu Feb 08 02:58:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.