[SERVER-90] case insensitive index Created: 03/Jun/09  Updated: 07/Apr/23  Resolved: 23/Aug/16

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: None
Fix Version/s: 3.3.11

Type: New Feature Priority: Major - P3
Reporter: Eliot Horowitz (Inactive) Assignee: Backlog - Query Team (Inactive)
Resolution: Done Votes: 359
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-1920 Sort by collation Closed
Duplicate
Related
related to SERVER-14784 Add expression indexes Backlog
related to DOCS-7503 Document the collation project (part 1) Closed
Assigned Teams:
Query
Backwards Compatibility: Fully Compatible
Participants:
Case:

 Description   
Issue Status as of Aug 23, 2016

ISSUE SUMMARY

Case-insensitive indices are available in MongoDB 3.3.11 via the new collation feature. Collations allow queries to match and sort UTF-8 string data in a locale-aware fashion, but also allow users to configure case sensitivity. For general documentation of collation, see SERVER-1920 and Development Series 3.3.x Collation.

TECHNICAL DETAILS

Case-insensitivity can be specified as a collation with a strength of 2. This strength indicates that the collator should make case-insensitive comparisons. (Strength 1, the weakest comparison level, is both case-insensitive and diacritic-insensitive. Higher strengths are more discriminating whereas lower strengths are less discriminating.) Case-insensitivity, like any collation, can be specified at the level of an operation, an index, or as a collection default. Queries or index builds with no explicit collation will inherit the collection's default collation. An index is eligible for use by case-insensitive string matching or sorting operations if its case-insensitive collation matches that of the query.

Example

The following example shows how to create and query a case insensitive index with English case folding rules:

> db.myCollection.createIndex({city: 1}, {collation: {locale: "en", strength: 2}});
> db.myCollection.insert({_id: 1, city: "New York"});
> db.myCollection.insert({_id: 2, city: "new york"});
> db.myCollection.find({city: "new york"}).collation({locale: "en", strength: 2});
{ "_id" : 1, "city" : "New York" }
{ "_id" : 2, "city" : "new york" }

IMPACT ON DOWNGRADE

Downgrade from 3.4 to 3.2 is illegal if the data files contain any collections with a case-insensitive default collation or any case-insensitive indices. Before downgrading, all collections and indices with an associated collation must be dropped.

FURTHER INFORMATION

Documentation for this feature is available in the 3.3.x development series release notes. To join our beta program for Case Insensitive Indices, and suggest improvements to our implementation, please email beta@mongodb.com.

Original description

potential syntax:

db.foo.ensureIndex( { name : 1 } , { caseInsensitive : true } )
db.foo.ensureIndex( { name : 1 } , { caseInsensitive : true , locale : "FR" } )
db.foo.ensureIndex( { name : 1 } , { caseInsensitive : true , localeKey : "user.country" } )
 
db.foo.ensureIndex( { name : 1 } , { caseInsensitive : [ "name" ] } )

reminder, you can aways do this for now:

{ name : { real : "Eliot" , sort : "eliot" } }
ensureIndex( { "name.sort" : 1 } )



 Comments   
Comment by Rahul B. Mattte [X] [ 02/Aug/16 ]

I am using below query to fetch the data in alphabetically sorted format

collection.col.find(

{name: somename}

).sort(

{name: 1}

).toArray(function(error, namelist)

{...}

)

The query returns the sorted data but in case sensitive format.

{ "name" : "AAA" }

,

{ "name" : "BBB" }

,

{ "name" : "aaa" }

Can you please suggest any solution to get the result in insensitive format like below

{ "name" : "aaa" }

,

{ "name" : "AAA" }

,

{ "name" : "BBB" }

Is this reported bug applicable for my issue also..please help me I am new to the mongoDB

Comment by Daniel Pasette (Inactive) [ 28/Jul/16 ]

Work on this feature is close to completion and scheduled to ship with 3.4. Once it's ready in a development release we will provide documentation and we'd love for you to give it a try.

Comment by Oleg Vivtash [ 28/Jul/16 ]

Is there any update on when to expect this feature?
It gets pretty messy to double sortable string fields amount when you have tens of them.

Comment by Jesper Erik Bendtsen [ 18/Mar/16 ]

Really hope this feature comes in the near future, everyone are used to search in case insensitive way, it really creates headache for developers!
Meanwhile, I started to save strings with certain style, if I know the field is like that, rather than to create two fields, one for search/sort and one for display who can make the doc huge.

Example in PHP
strToUpper () - Make a string uppercase
strtolower () - Make a string lowercase
ucfirst () - Make a string's first character uppercase
ucwords () - Uppercase the first character of each word in a string
none = Slow search

So when users save data I change it to one of the above. And when users searching I change the input to above and then do the search/sort.
If I do not know if the field has a certain style. I use REGEX like this "new MongoDB \ BSON \ Regex ('^'. $ Value. '$', 'In')" which is slow. I really waiting for MongoDB will support case insensitive search...

Know the above is not super, but maybe it will help someone until case insensitive search comes, really hopes it comes!? This ticket is from 2009, it worries me! Please reply with future plans?

Comment by Roberto Germano Vieweg Neto [ 09/Mar/16 ]

Another good feature can also be pass a compare function for the sort method, like this:

db.myEntity.find().sort(function(myEntityA, myEntityB) { 
   return myEntityA.field.toLowerCase().compareTo(myEntityB.field.toLowerCase()); 
})

Sometimes you need to sort little data, and you doesn't want/need to create an index to do that.

Comment by Tim Hardy [ 09/Feb/16 ]

For the love of all that is good, PLEASE add case insensitive indexes! Pretty much all web development involves the ability to search by string, and the standard is case insensitive.

When you type a search into Amazon, do you expect your search terms to be case sensitive? No, you don't. If you're a web developer and use MongoDb, you have to bend over backwards to provide this incredibly common capability.

How Mongo can style itself as an enterprise option nowadays and not support this is beyond me.

Comment by Damjan Rems [ 04/Feb/16 ]

I also become more and more dependend on alternative sorting.

I have been thinking about the issue and as I remember I have read somewhere that it was so important to MongoDB team to have fast sorting algorithm that they didn't want to complicate sorting process.

So lets save data in sorting indexes the way that doesn't change sorting process. My proposal is to replace charachters (binary code) in String keys just before it is saved to index key, with characters provided by collation translation tabels. This also means that find algorithm should do the same operation before index searching is done.

The other solution would be to replace characters before they are saved to document. As of MongoDB 3.2 encription is available, which does more or less the same thing.

Mongo team. Please we are waiting alternative sorting for too long.

by
Damjan Rems, http://www.drgcms.org

Comment by dane truelson [ 28/Dec/15 ]

Hello guys... This is a very important feature for almost any application that deals with non-numeric search. It does not make sense that I have to store two copies of every person's name just to get a search that is fast enough.

Comment by Josef Sábl [ 30/Sep/15 ]

It is very frustrating that this obvious thing is still unresolved. We are actually dumping MongoDB for Elastic Search for our search needs. Congrats

Comment by chuong nguyen [ 28/Jul/15 ]

Please help me sort with char uppercase and lowercase

Comment by Tuner [ 09/Sep/14 ]

Hope it helps guys, http://derickrethans.nl/mongodb-collation.html

Comment by Laurentiu Macovei [ 09/Sep/14 ]

This is a must have in terms of easing very common basic usage. Too bad in 4 years nothing changed

Comment by John Page [ 20/Aug/14 ]

I believe are a couple of workarounds to this one of which is to do with the correct pay to page (i.e. never using $skip as many seem to).

For a simple case insensitive search create a full text search index on the field, when querying query on the full text search index AND a case insensitive regex on the field to ensure no false positives.

When paging - the important thing is to query every time for values greater than the last on the page of results - limit X where X is the page size. Sort and skip has you walking through the results or the index taking longer for each page - search > last value and sort is better (assuming an index to ensure you get them in order)

If you use a case Sensitive index for this you do two searches limit pagesize with the first letter changed case then sort the results on display - so if the last thing on the page was

'john.page@mongodb.com'

Query for >= 'john.page@mongodb.com' limit 10
And in parallel >= 'John.page@mongoDB.com'

Then sort those results and show the first 10.

Not perfect but still quite fast.

You can only do this to page through results from the start - if you want to jump to a specific page it's a lot harder, but if you are indexing on other fields as well possibly not that hard.

Comment by Hendy Irawan [ 05/Jul/14 ]

+1. tshawkins now that you mentioned it, yes we need SERVER-153.

but IMHO this one is more priority due to extremely common usage.

Comment by Ivan Fioravanti [ 11/Jun/14 ]

Again on this... the fact of not being able to do a case insensitive sort through an index is really becoming a blocking issue in some scenario, especially in big data world.

We've already faced 2 situations where we need to perform sort and paging on quite large datasets (3M items) in order to display data in a web app.
The fact of using an additional column with "toLower" representation isn't really feasible because sort can be performed on many different fields.

I imagine this is a nightmare from implementation point of view, but it's a really important feature when you've to deal with large datasets.
Can you please give us an update on this topic? In order for us to decide if we have to think about a temporary workaround for our customers?

Thanks

Comment by Ivan Fioravanti [ 22/May/14 ]

I love MongoDB too and I'm pushing its usage everywhere in our projects and for our customers. I'm receiving many requests related to Case Insensitive search and I'd really like to have this option built-in the product. For some customers we have used Elasticsearch for this and other usage, but it's too much for standard use.

+1

Thanks.

Comment by Paco Hernández [ 05/May/14 ]

I am an advocate of MongoDB, I love the product, but I can not understand how something so common & so critical has not been addressed in over 4 years.

The solution to duplicate columns doubles the space in disk, besides having to apply additional projections to prevent these fields are returned, and so on. Not an elegant solution, but it is the only one we have.

Is there any hope that something will be implemented in the next version of Mongo?

+1 To the creation of a special index.

Thanks.

Comment by Ovidiu Anicai [ 03/Dec/13 ]

You can duplicate the column with all lowercase (maybe also replace characters with accents or special chars) then you do the search with a lowercase string.
If you do this you'll have to keep the consistency on edits

Comment by Tim Hawkins [ 26/Mar/13 ]

@arkady
Unicode folding is required anyway for proper multilingual text search.

Comment by Joel Sanderson [ 15/Feb/13 ]

@Clifford - I also like your concept of using a function to compute the index value. I'd love to see this functionality in MongoDB!

Comment by Phil Idem [ 08/Oct/12 ]

If this feature and Full-Text Searching (SERVER-380) were available then I could see how MongoDB could replace other search technologies such as SOLR, Lucene, and Sphinx.

@Clifford, good suggestion. However, I think the query engine might need a special syntax when using find so that caller can clearly state that they are doing a case-insensitive search. I doubt you want to prevent the ability to do a case-sensitive search just because you have a case-insensitive index.

Postgres has a similar feature:
http://www.postgresql.org/docs/9.2/static/indexes-expressional.html

Similar to Postgres, I think it should be possible to use the expression without the index (the index should be used to improve performance).

Maybe something like this could be used:

// register a server-side function named "lower" that can be used to convert a field to lower case
db.system.js.save({
    _id : "lower",
    value : function(fieldName) {
        return this[fieldName].toLowerCase();
    }});
 
// index the lower case value of firstName
db.users.ensureIndex( { "lower('firstName')": 1 } )
 
// find all users whose lower case first name starts with "john" (would match "Johnny", "John", "john", etc.)
db.users.find({
    "lower('firstName')" : {
        $regex : /^john/
    }
});

Comment by Arkadiy Kukarkin [ 19/Sep/12 ]

The lowercase index suggestion (as well as the current lowercase field workaround) don't really work outside ascii. Given that

Reichwaldstraße → reichwaldstrasse
REICHWALDSTRASSE → reichwaldstrasse
όσος → ΌΣΟΣ
ΌΣΟΣ → όσοσ

should REICHWALDSTRASSE match documents with Reichwaldstraße or reichwaldstrasse? ΌΣΟΣ to όσος or όσοσ? etc. You really need proper unicode case folding for these.

That being said, very very needed feature, maybe even in a degraded ascii-only version.

Comment by Eliot Horowitz (Inactive) [ 19/Jul/12 ]

I think indexes on expressionsa re definitely something we want to do.
Thought a case insensitive index is a bit different than toLower because of sorting in different locales, etc...

Comment by Idris Mokhtarzada [ 19/Jul/12 ]

+1 for Clifford's computed idea

Comment by Clifford Hammerschmidt [ 19/Jul/12 ]

Can we consider making this more general.
e.g.:

db.users.ensure_index({name:{order: 1, computed: function() { return this.name.toLowerCase(); }});
db.users.add(

{name: "Bob"}

);
db.users.find(

{name: "BOB"}

); -> returns {_id:ObjectId(...), name: "Bob"}

So the index on name, and queries against name, are passed through the "computed" logic.

There are issues with using a surrogate field and placing this logic to populate the surrogate field in the applications accessing the db rather than in the indexing logic in the db:
• applications (and their instances) can differ in their logic (e.g. bugs, version drift, multiple consuming applications, etc.)
• the intent of the code is to make the index case insensitive, if the intent changes we'd have to change all the applications (and their installs) accessing the data rather than changing the indexing function in the db once.
• the surrogate field populated by the application would be exposed (e.g. you'd have to add something like "name_lowercase".)

Of course having:

db.users.ensure_index({name:{order:1, index_type: case_insensitive}});

Where the index_ype is used to implement the index logic in C rather than javascript would be much faster. If you're open to that I'd suggest having javascript computation as a type, e.g.:

db.users.ensure_index({name:{order:1, index_type: javascript, javascript: function()

{ return this.name.toLowerCase();}

}});

Or even an FFI version that loads the indexing logic from a plugin:

db.users.ensure_index({name:{order:1, index_type: ffi_cdecl, ffi_lib:"custom.so"}});
extern "C" BSONObj Computed(BSONObj obj)

{ return ... }

// C signature might be wrong, it's been a while since I've done C...

Comment by jonathan roeder [ 15/May/12 ]

Is it agreed that this shouldn't just be an index feature? Lookup via index or scan should behave consistently.

Comment by Eliot Horowitz (Inactive) [ 10/Apr/12 ]

You can look at "fix for" above to see when it is scheduled.

Comment by Sonal Shah [ 06/Apr/12 ]

Any Idea when can it available?

Thanks,
Sonal Shah

Comment by Tuner [ 01/Apr/12 ]

+1 Major feature, and extremely important in every non-ascii database...

Comment by Sonal Shah [ 29/Mar/12 ]

Any updates on this feature ? When can be it available. I need it badly.

Thanks,
Sonal Shah

Comment by Eliot Horowitz (Inactive) [ 25/Mar/12 ]

If you use utf8_unicode_ci in mysql, I believe it just ignores accents, etc...

Or you can choose a specific language collation to get correct sorting.

We could do that, I think the discussion is around whether there is a "better" solution.

Likely not...

Comment by Robert La Ferla [ 25/Mar/12 ]

How do other databases like Oracle/MySQL handle this? MongoDB should be no different.

Comment by Colin Mollenhour [ 14/Dec/11 ]

If a field had a case-insensitive index, would queries match case-insensitively only when using the index? If so that could be very confusing.. What about "ci" operators like .find({"foo":{$ciEq:"bar"}}) and .sort(

{"foo":2}

) and .sort(

{"foo":-2}

). Sorts and queries that don't specify "ci" operators or +/-2 would either not use the index or use the index only if it is case-sensitive (same behaviour as current). "ci" queries/sorts would use an index if available, but if not available would still do the correct matching/sorting (would be very slow of course).

Comment by Remon van Vliet [ 16/Mar/11 ]

I don't think localization is directly related to indexes and I doubt it should be part of ensureIndex syntax.

Also, is there a specific spec available on how this will work internally? For example, will turning on the case insensitivity flag simply mean lowercase versions of strings are stored in the index for case insensitive sorts or will the actual index value be replaced with the lowercase version in which case query behaviour is affected. In other words, with this index flag on will find(

{a:"Hi"}

) and find(

{a:"hi"}

) return the same document?

Comment by Idris Mokhtarzada [ 15/Mar/11 ]

If you read the comments or the end of the description, you'll see there's a workaround that will allow case-insensitive sorting, so no, it's not required. However, that workaround is obviously less than ideal.

Comment by Doug Mayer [ 15/Mar/11 ]

Has anyone looked at this in the last 8 months? In order to do case-insensitive sorting on the MongoDB server side, is this not effectively required?

Comment by Eliot Horowitz (Inactive) [ 27/Jul/10 ]

For short term - I would just add a 2nd field that you call .toLower() on before inserting.
Then you can sort on that.

Comment by Sean Wolfe [ 27/Jul/10 ]

What timeframe is expected to add this feature? Users are complaining about how our sorts are case sensitive. Is there an alternate way of sorting records on a key in a case-insensitive fashion?

Comment by Eliot Horowitz (Inactive) [ 02/Apr/10 ]

trying to keep 1.5/1.6 very focused on sharding + replica sets.
will try to get it out asap so we can go back to features like these

Generated at Thu Feb 08 02:52:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.