[SERVER-9938] Inefficient index boundary selected when using character class regex Created: 14/Jun/13  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Jeff lee Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 3
Labels: storch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-22722 Ranged regex uses inefficient indexBo... Closed
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   

Hi,

It seems that when you use a character class regex in a find operation, it results in a full index scan even when the character class is anchored.

When the field is an array, it can result in a huge performance hit as each document is accessed multiple times for each indexed array element.

Note how we have 8 scanned objects in the following example:

db.foo.save({ "_id" : 1, "keywords" : [  "a" ] })
db.foo.save({ "_id" : 2, "keywords" : [  "b" ] })
db.foo.save({ "_id" : 3, "keywords" : [  "c" ] })
db.foo.save({ "_id" : 4, "keywords" : [  "a",  "b" ] })
db.foo.save({ "_id" : 5, "keywords" : [  "a",  "b",  "c" ] })
db.foo.ensureIndex({ keywords:1 })
 
> db.foo.find({ keywords:/^[bc]/ }).explain()
{
	"cursor" : "BtreeCursor keywords_1 multi",
	"isMultiKey" : true,
	"n" : 4,
	"nscannedObjects" : 8,
	"nscanned" : 8,
	"nscannedObjectsAllPlans" : 8,
	"nscannedAllPlans" : 8,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		"keywords" : [
			[
				"",
				{
 
				}
			],
			[
				/^[bc]/,
				/^[bc]/
			]
		]
	},
	"server" : "Jeffs-MacBook-Air.local:27017"
}

The workaround seems to be to specify each element of the character class individually:

> db.foo.find({ keywords:{ $in:[ /^b/, /^c/ ] }}).explain()
{
	"cursor" : "BtreeCursor keywords_1 multi",
	"isMultiKey" : true,
	"n" : 4,
	"nscannedObjects" : 5,
	"nscanned" : 5,
	"nscannedObjectsAllPlans" : 5,
	"nscannedAllPlans" : 5,
	"scanAndOrder" : false,
	"indexOnly" : false,
	"nYields" : 0,
	"nChunkSkips" : 0,
	"millis" : 0,
	"indexBounds" : {
		"keywords" : [
			[
				"b",
				"d"
			],
			[
				/^b/,
				/^b/
			],
			[
				/^c/,
				/^c/
			]
		]
	},
	"server" : "Jeffs-MacBook-Air.local:27017"
}



 Comments   
Comment by Guy Arad [ 11/Mar/19 ]

I'm experiencing a similar issue, but in my case, the wrong index is being chosen. I can understand why bounds are being expanded until the first meta character, but can't understand why the wrong index is being selected. 

I have these two indices:

  1. app_key: 1, udid: 1
  2. app_key: 1, timestamp: 1

Running this query

{app_key: 'abcde', udid: /^2b/}

gives the expected performance (as seen with explain).

Running this query

{app_key: 'abcde', udid: /^2[Bb]/}

 results in examining all the docs with the given app_key. The selected index was #2 above. 

Running the previous query, providing a hint for index #1 gives much better results, but still the keys examined is about 10 times the docs returned because of the bounds for the udid field: [ ["2", "3"), [/^2[bB]/, /^2[bB]/] ].

Using this query:

{app_key: 'abcde', $or: [{udid: /^2B/}, {udid: /^2b/} ]}

did not help. Actually, when using a hint of index #1 with this query, the bounds for `udid` were `[MinKey, MaxKey]` and resulted in the worst performance.

Best performance achieved by not using the regex character class at all, and using `$in` instead:

{app_key: 'abcde', udid: {$in: [ /^2b/, /^2B/ ]}}

Keys examined, docs examined and docs returned were all the same value. These were the bounds: `["2B", "2C"), ["2b", "2c"), [/^2B/, /^2B/], [/^2b/, /^2b/]`

 

(note: for the life of me, I couldn't get the code parts to work nicely in the commenting framework - sorry)

Generated at Thu Feb 08 03:21:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.