[SERVER-20432] $regex prefix search with escaped "|" should use tighter index bounds Created: 16/Sep/15  Updated: 16/May/18  Resolved: 26/Oct/17

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.0.5, 3.0.6
Fix Version/s: 3.6.0-rc2

Type: Bug Priority: Critical - P2
Reporter: ma6174 Assignee: Kyle Suarez
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-29883 "|" in query string occur into a full... Closed
Related
related to SERVER-26536 Regex containing an escaped | pipe sy... Closed
related to SERVER-16622 RegEx query predicates using the | (v... Backlog
is related to SERVER-15235 Regex query returns incorrect results... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Initialization

> db.version()
3.0.6
> db.test_search.drop()
true
> for (var i=1;i<=100;i++){db.test_search.save({_id:"id"+i})}
WriteResult({ "nMatched" : 0, "nUpserted" : 1, "nModified" : 0, "_id" : "id100" })
> db.test_search.find().limit(3)
{ "_id" : "id1" }
{ "_id" : "id2" }
{ "_id" : "id3" }
> db.test_search.find().count()
100

search and view mongodb log(run with verbose)

1. search with |

> db.test_search.find({_id:/^id\|/})
> db.test_search.find({_id:/^id\|sssssssss/})
> db.test_search.find({_id:{"$regex":"^id\\|"}})
> db.test_search.find({_id:{"$regex":"^id\\|sssssssss"}})

mongodb v3.0.6 log show that "nscanned:100"

2015-09-16T15:10:21.551+0800 I QUERY    [conn2128] query test.test_search query: { _id: /^id\|/ } planSummary: IXSCAN { _id: 1 } ntoreturn:0 ntoskip:0 nscanned:100 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms
 
2015-09-16T15:10:55.456+0800 I QUERY    [conn2128] query test.test_search query: { _id: /^id\|sssssssss/ } planSummary: IXSCAN { _id: 1 } ntoreturn:0 ntoskip:0 nscanned:100 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms
 
2015-09-16T15:11:16.113+0800 I QUERY    [conn2128] query test.test_search query: { _id: { $regex: "^id\|" } } planSummary: IXSCAN { _id: 1 } ntoreturn:0 ntoskip:0 nscanned:100 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms
 
2015-09-16T15:11:34.034+0800 I QUERY    [conn2128] query test.test_search query: { _id: { $regex: "^id\|sssssssss" } } planSummary: IXSCAN { _id: 1 } ntoreturn:0 ntoskip:0 nscanned:100 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms

we also tested in v2.4.7, same query showed "nscanned:0"

Wed Sep 16 15:43:40.854 [conn3077] query test.test_search query: { _id: /^id\|/ } ntoreturn:0 ntoskip:0 nscanned:0 keyUpdates:0 locks(micros) r:3882 nreturned:0 reslen:20 3ms
 
Wed Sep 16 15:43:57.683 [conn3077] query test.test_search query: { _id: /^id\|sssssssss/ } ntoreturn:0 ntoskip:0 nscanned:0 keyUpdates:0 locks(micros) r:115 nreturned:0 reslen:20 0ms
 
Wed Sep 16 15:44:06.890 [conn3077] query test.test_search query: { _id: { $regex: "^id\|" } } ntoreturn:0 ntoskip:0 nscanned:0 keyUpdates:0 locks(micros) r:168 nreturned:0 reslen:20 0ms
 
Wed Sep 16 15:44:13.571 [conn3077] query test.test_search query: { _id: { $regex: "^id\|sssssssss" } } ntoreturn:0 ntoskip:0 nscanned:0 keyUpdates:0 locks(micros) r:121 nreturned:0 reslen:20 0ms

2. search without |

> db.test_search.find({_id:/^idA/})
> db.test_search.find({_id:/^id\?/})

mongodb v3.0.6 show that "nscanned:0", this is expected

2015-09-16T15:16:25.091+0800 I QUERY    [conn2128] query test.test_search query: { _id: /^idA/ } planSummary: IXSCAN { _id: 1 } ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 1ms
 
2015-09-16T15:16:51.971+0800 I QUERY    [conn2128] query test.test_search query: { _id: /^id\?/ } planSummary: IXSCAN { _id: 1 } ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms

mongodb explain info

> db.test_search.find({_id:/^id\|/}).explain()
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.test_search",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"_id" : /^id\|/
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"filter" : {
					"_id" : /^id\|/
				},
				"keyPattern" : {
					"_id" : 1
				},
				"indexName" : "_id_",
				"isMultiKey" : false,
				"direction" : "forward",
				"indexBounds" : {
					"_id" : [
						"[\"\", {})",
						"[/^id\\|/, /^id\\|/]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "test.local",
		"port" : 27017,
		"version" : "3.0.6",
		"gitVersion" : "nogitversion"
	},
	"ok" : 1
}
>
 
> db.test_search.find({_id:/^idA/}).explain()
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.test_search",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"_id" : /^idA/
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"_id" : 1
				},
				"indexName" : "_id_",
				"isMultiKey" : false,
				"direction" : "forward",
				"indexBounds" : {
					"_id" : [
						"[\"idA\", \"idB\")",
						"[/^idA/, /^idA/]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "test.local",
		"port" : 27017,
		"version" : "3.0.6",
		"gitVersion" : "nogitversion"
	},
	"ok" : 1
}
>

Sprint: Query 2017-10-23, Query 2017-11-13
Participants:
Case:

 Description   

we found an slow log, it takes 271 seconds, after check, we found that the "nscanned:107090777" is vary large. we expect that the query is an prefix search, should use index, and match only several documents.

2015-09-14T10:57:27.819+0800 I QUERY    [conn17228] query fakedb.fakecoll query: { $query: { _id: { $regex: "^fake\|other_query_string" } }, $orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:200 ntoskip:0 nscanned:107090777 nscannedObjects:0 ke
yUpdates:0 writeConflicts:0 numYields:836649 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 1673300 }, acquireWaitCount: { r: 99240 }, timeAcquiringMicros: { r: 39751354
 } }, Database: { acquireCount: { r: 836650 } }, Collection: { acquireCount: { r: 836650 } } } 271897ms



 Comments   
Comment by Kyle Suarez [ 26/Oct/17 ]

Hey all,

With the commit above, regular expressions containing certain sequences of escaped | characters will now be treated as non-special and are eligible to use tight index bounds. This doesn't include every conceivable way you can have a literal pipe character in a regular expression; for example, a pipe in a character class or escaped with the \Q...\E escape sequence may still be mistakenly treated as a special regex. A full fix that works in all circumstances would depend on SERVER-16622.

Hopefully, this fix improves performance for some common use cases. You can expect it to be included in the coming MongoDB 3.6 release.

Best,
Kyle

Comment by Githook User [ 26/Oct/17 ]

Author:

{'email': 'kyle.suarez@mongodb.com', 'name': 'Kyle Suarez', 'username': 'ksuarz'}

Message: SERVER-20432 allow some escaped | chars in regexes to use tight index bounds
Branch: master
https://github.com/mongodb/mongo/commit/260fd0c76599520d9d733874753a94a2db763538

Comment by Oleg Rekutin [ 29/Aug/17 ]

Thank you Ian!

Comment by Ian Whalen (Inactive) [ 25/Aug/17 ]

Hey Oleg thanks for following up for your thoughts. Dave and the query team have talked and agreed. We'll see if we can get this in for the upcoming 3.6 release.

Comment by Oleg Rekutin [ 28/Apr/17 ]

More detail. Specifically this line is wrong:

https://github.com/mongodb/mongo/blob/r3.2.12/src/mongo/db/query/index_bounds_builder.cpp#L71

 if (StringData(regex).find('|') != std::string::npos)

Cleary, that code ignores escaped | characters!

These two regular expressions are different:
A. /^(a|b)/
B. /^a|b/

Expression A is NOT a prefix query. Expression B is 100% a prefix query, with nothing optional in it. However, due to the code referenced above and this bug, it still fails the "simpleRegex" check.

Comment by Oleg Rekutin [ 28/Apr/17 ]

@david.storch I think you are missing the point... the point is that an escaped pipe character should NOT at all be treated in any special way. I don't think it is at all valid to close this bug as a duplicate SERVER-16622.

This bug should be reopened.

SERVER-16622 describes smart handling of an actual | character, which IS NOT the case here. Merely using a pipe character in the text makes it impossible to do prefixed index scans.

Comment by David Storch [ 05/Oct/15 ]

Hi ma6174,

After reviewing this ticket, the engineering team responsible for the "Querying" component has decided to consider this a duplicate of SERVER-16622. Fixing the backslash-escaped "|" character case makes sense to do as part of the larger ticket, as this will require parsing the regular expression and analyzing the parse tree. From an engineering perspective, we would much rather use proper regex parsing than introduce a hack that special cases the string "|".

There is a lot more context on SERVER-16622, as my colleague Stephen pointed out in an earlier response. Please watch SERVER-16622 for progress updates.

Best,
Dave

Comment by ma6174 [ 16/Sep/15 ]

Thanks.

Comment by Stennie Steneker (Inactive) [ 16/Sep/15 ]

Hi ma6174,

We appreciate all bug reports and product suggestions. This issue will be triaged by the engineering team and considered in the next round of planning. You can upvote and watch this issue for updates.

Thanks,
Stephen

Comment by ma6174 [ 16/Sep/15 ]

Hi,

Since we had escaped the character "|" to "\ |" , this should be regarded as a normal character, not regular expression "OR", this should use index to search.

If this not be fixed, and some user use this to search, if the collection very large, the whole datebase maybe become very slow or unavailable.

Will you consider to fix this?

Regards,
ma6174

Comment by Stennie Steneker (Inactive) [ 16/Sep/15 ]

Hi,

Thanks for reporting this issue and including steps to reproduce.

It looks like this is a consequence of the changes for SERVER-15235, which consider any regular expression containing "|" as requiring an index scan for correctness reasons (also see the discussion on SERVER-16622).

In your particular case the "|" is escaped but unfortunately this is still matching as a "non-simple" regex, which results in a full index scan rather than bounds limited to the regex prefix.

Regards,
Stephen

Generated at Thu Feb 08 03:54:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.