Hi,
It seems that when you use a character class regex in a find operation, it results in a full index scan even when the character class is anchored.
When the field is an array, it can result in a huge performance hit as each document is accessed multiple times for each indexed array element.
Note how we have 8 scanned objects in the following example:
db.foo.save({ "_id" : 1, "keywords" : [ "a" ] })
db.foo.save({ "_id" : 2, "keywords" : [ "b" ] })
db.foo.save({ "_id" : 3, "keywords" : [ "c" ] })
db.foo.save({ "_id" : 4, "keywords" : [ "a", "b" ] })
db.foo.save({ "_id" : 5, "keywords" : [ "a", "b", "c" ] })
db.foo.ensureIndex({ keywords:1 })
> db.foo.find({ keywords:/^[bc]/ }).explain()
{
"cursor" : "BtreeCursor keywords_1 multi",
"isMultiKey" : true,
"n" : 4,
"nscannedObjects" : 8,
"nscanned" : 8,
"nscannedObjectsAllPlans" : 8,
"nscannedAllPlans" : 8,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"keywords" : [
[
"",
{
}
],
[
/^[bc]/,
/^[bc]/
]
]
},
"server" : "Jeffs-MacBook-Air.local:27017"
}
The workaround seems to be to specify each element of the character class individually:
> db.foo.find({ keywords:{ $in:[ /^b/, /^c/ ] }}).explain()
{
"cursor" : "BtreeCursor keywords_1 multi",
"isMultiKey" : true,
"n" : 4,
"nscannedObjects" : 5,
"nscanned" : 5,
"nscannedObjectsAllPlans" : 5,
"nscannedAllPlans" : 5,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"keywords" : [
[
"b",
"d"
],
[
/^b/,
/^b/
],
[
/^c/,
/^c/
]
]
},
"server" : "Jeffs-MacBook-Air.local:27017"
}
- is duplicated by
-
SERVER-22722 Ranged regex uses inefficient indexBounds
-
- Closed
-