[SERVER-29865] Make regex use collation index Created: 26/Jun/17  Updated: 27/Oct/23  Resolved: 30/Jun/17

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: dane truelson Assignee: Backlog - Query Team (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query
Participants:

 Description   

Case insensitive indexes should be used when running a case insensitive "starts with" regex like this /^searchterm/i

Currently if you make a case insensitive collation and run a "starts with" regex query like this /^searchterm/i, mongodb will not use the index. This is a real problem for any application that uses an autocomplete feature.



 Comments   
Comment by Tess Avitabile (Inactive) [ 23/Aug/18 ]

Yes, please open a separate ticket.

Comment by David Mumladze [ 23/Aug/18 ]

Thanks for clarifying.

That makes sense as collation order needs to be considered, which means this approach wouldn't be suitable for collections with data in multiple languages, so I'm abandoning this approach for now.

I'm also trying another option, where all names are transformed to lowercase and pushed to an array, which is searched using the regex prefix: 

db.people.find({ search: /^souz/ })

Some observations

  • Sorting does not use keys defined in the index (range based or not).
  • Regex is quick on collections without a default collation, but otherwise very slow, although in both cases, it's using the index scan.

More on sorting.. Index is defined as: 

{search: 1, score: -1, lname: 1}

and my query is:

db.people.find({ search: /^souz/ }).sort({ score: -1, lname: 1 })

.Sorting also didn't work with range query. Btw, if I create an auxiliary index as:

{ score: -1, lname: 1 }

then that index is used and results in full scan.

Should I open a separate case?  

Comment by Tess Avitabile (Inactive) [ 23/Aug/18 ]

Hi David,

When the prefix ends with 'z', you can "increment" previous letter of the prefix in the upper bound, for example:

db.people.find({$and: [{a: {$gte: "souz"}}, {a: {$lt: "sov"}}]})

If the prefix is all 'z's, then you can leave off the upper bound:

db.people.find({a: {$gte: "zzz"}})

This assumes that all of the characters in the names are alphabetical.

I am not certain why you are not observing "souza" >= "souz" && "souza" < "souª" in MongoDB, but I would guess it is because 'ª' does not follow 'z' in the collation that you are using.

Best,
Tess
 

Comment by David Mumladze [ 23/Aug/18 ]

Hi Tess,

Could you please explain how range approach may work for below scenario?

There are last names that have 'z' in them, so if the user is searching for 'souz' prefix, then the next higher character after 'z' would be '{', which is not working, so I tried a char regarded as letter 'ª', but it still didn't work.

Query came out as 

db.contacts.find({"lname":{ $gte:"souz", $lt:"souª"}})

Also, run this in any program and it'll return true, but I'm puzzled as to why MongoDB cannot do it. 

"souza" >= "souz" && "souza" < "souª"

Collection collation is

{ locale: "en", strength:2 } 

with corresponding index on lname field.

Thanks,
David

Comment by dane truelson [ 01/Jul/17 ]

It works! Thank you! Here is my code to add the proper search criteria to the find object in case anybody else wants it. I just made it today, so it is not guaranteed to be bug free.

function getEndStr(str) {
	var endStrArr = str.toLocaleLowerCase('en-US').split("");
	for (var i = endStrArr.length - 1; i >= 0; --i) {
		var lastChar = endStrArr[i];
		var nextChar  = String.fromCharCode(lastChar.charCodeAt(0) + 1);
		if(nextChar === ":")
			nextChar = "a";
		console.log("NEXT ", nextChar);
		if (nextChar !== false) {
			endStrArr[i] = nextChar;
			return endStrArr.join("");
		}
		endStrArr.pop();
	}
}
global.addStartsWithQuery = function (searchCriteria, propertyName, str) {
	if (!(typeof str === 'string') || !str.length)
		return;
	var endStr = getEndStr(str);
	if (endStr) {
		if (!searchCriteria.$and)
			searchCriteria.$and = [];
		searchCriteria.$and.push({
			[propertyName]: {
				$gte: str
			}
		});
		searchCriteria.$and.push({
			[propertyName]: {
				$lt: endStr
			}
		});
	} else {
		searchCriteria[propertyName] = {
			$gte: str
		}
	}
}

Comment by Tess Avitabile (Inactive) [ 30/Jun/17 ]

Ah sorry I didn't point this out in my example. For the upper bound, you need to "increment" the last letter in the prefix. So in this query, you would use

db.people.find({$and: [{a: {$gte: "joh"}}, {a: {$lt: "joi"}}]})

Comment by dane truelson [ 30/Jun/17 ]

Hi Tess,

Thanks so much for the reply. You all have done great work with collation support! I tried the workaround you suggested, and it did not work. I think it's because the word "john" is $gte "joh", but it is NOT $lte "joh". In other words, the full word is greater than the partial word, so the $lt part won't work. I did some testing and found this:
Database object

{ firstName:"john" }

db.people.find({firstName:{$gte:"joh"}});
returns the correct document plus more

now when I run
db.people.find({firstName:{$lte:"joh"}});
it does not return the correct document. Aparently "john" is greater than "joh"

Would you please double check the workaround and let me know if there is another way?

Comment by Tess Avitabile (Inactive) [ 30/Jun/17 ]

Hello Dane,

You raise a great question. This behavior is by design. In an index that uses collation, strings are stored as their ICU collation keys for that collation. So it seems as though for the regex /^searchterm\/i on an index with case-insensitive collation, we could use index bounds [ICUCollationKey("searchterm"), ICUCollationKey("searchtern")). However, while this would work in the English collation, it will not work for all collations. For example, for locales that sort from back to front, not all strings with the same prefix are sorted together, e.g. "ab" < "cb" < "ac". In general, the locale is allowed to define an arbitrary sort order over strings, so we cannot make assumptions as to how these sort orders relate to the binary comparison expected from regexes.

As a workaround, you can write "starts with" regex queries using $gt an $lt, which can use a collation, i.e.

db.c.find({$and: [{a: {$gte: "searchterm"}}, {a: {$lt: "searchtern"}}]}).collation({locale: "en", strength: 2})

If the collation of the operation matches the collation of the index, it will be able to use that index.

Best,
Tess

Generated at Thu Feb 08 04:22:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.