[SERVER-15696] $regex, $in and $sort with index returns too many results Created: 16/Oct/14  Updated: 11/Jul/16  Resolved: 04/Nov/14

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.6.4, 2.6.5, 2.7.7
Fix Version/s: 2.6.6, 2.8.0-rc0

Type: Bug Priority: Major - P3
Reporter: Simon Lavigne-Giroux Assignee: David Storch
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Backport Completed:
Steps To Reproduce:

My index is :

{
                "v" : 1,
                "key" : {
                        "channelId" : 1,
                        "searchField" : 1
                },
                "name" : "channelId_1_searchField_1",
                "ns" : "search.searchSeriesOnChannel1"
}

Here's a working query hinting not to use the index which returns 2 results :

db.searchSeriesOnChannel1.find({$and : [{ "searchField" : { "$regex" : "_dr" , "$options" : "i"}, "channelId" : { "$in" : [ "CTVTo" , "CTVHT"]}}]}).sort({searchField:1}).hint({$natural:1}).explain()
{
        "cursor" : "BasicCursor",
        "isMultiKey" : false,
        "n" : 2,
        "nscannedObjects" : 21743,
        "nscanned" : 21743,
        "nscannedObjectsAllPlans" : 21743,
        "nscannedAllPlans" : 21743,
        "scanAndOrder" : true,
        "indexOnly" : false,
        "nYields" : 169,
        "nChunkSkips" : 0,
        "millis" : 59,
        "server" : "fonsemongo:27017",
        "filterSet" : false
}

Using the index, it returns 62 elements because it discards the searchField $regex filter.

db.searchSeriesOnChannel1.find({$and : [{ "searchField" : { "$regex" : "_dr" , "$options" : "i"}, "channelId" : { "$in" : [ "CTVTo" , "CTVHT"]}}]}).sort({searchField:1}).explain()
{
        "clauses" : [
                {
                        "cursor" : "BtreeCursor channelId_1_searchField_1",
                        "isMultiKey" : false,
                        "n" : 62,
                        "nscannedObjects" : 62,
                        "nscanned" : 62,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nChunkSkips" : 0,
                        "indexBounds" : {
                                "channelId" : [
                                        [
                                                "CTVHT",
                                                "CTVHT"
                                        ]
                                ],
                                "searchField" : [
                                        [
                                                "",
                                                {
 
                                                }
                                        ],
                                        [
                                                /_dr/i,
                                                /_dr/i
                                        ]
                                ]
                        }
                },
                {
                        "cursor" : "BtreeCursor channelId_1_searchField_1",
                        "isMultiKey" : false,
                        "n" : 0,
                        "nscannedObjects" : 0,
                        "nscanned" : 0,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nChunkSkips" : 0,
                        "indexBounds" : {
                                "channelId" : [
                                        [
                                                "CTVTo",
                                                "CTVTo"
                                        ]
                                ],
                                "searchField" : [
                                        [
                                                "",
                                                {
 
                                                }
                                        ],
                                        [
                                                /_dr/i,
                                                /_dr/i
                                        ]
                                ]
                        }
                }
        ],
        "cursor" : "QueryOptimizerCursor",
        "n" : 62,
        "nscannedObjects" : 62,
        "nscanned" : 62,
        "nscannedObjectsAllPlans" : 62,
        "nscannedAllPlans" : 62,
        "scanAndOrder" : false,
        "nYields" : 1,
        "nChunkSkips" : 0,
        "millis" : 0,
        "server" : "fonsemongo:27017",
        "filterSet" : false
}

Participants:

 Description   

The combination of a $regex, $in and $sort using an index returns too many results. The sort is done on the field filtered with a $regex and that filter is completely discarded. We get 62 results with the index and 2 results without it.



 Comments   
Comment by Githook User [ 18/Nov/14 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-15696 don't drop filters while creating a SORT_MERGE plan

(cherry picked from commit dcb68b25fe7351888385cc435f339ada8e8b5e9e)
Branch: v2.6
https://github.com/mongodb/mongo/commit/ffab5c390b697d0c6e94d9ec160dade57aa07246

Comment by Githook User [ 04/Nov/14 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-15696 don't drop filters while creating a SORT_MERGE plan
Branch: master
https://github.com/mongodb/mongo/commit/dcb68b25fe7351888385cc435f339ada8e8b5e9e

Comment by Asya Kamsky [ 20/Oct/14 ]

nicou50

I mean that we would need to repeat regex query X times instead of writing it down only once.

It's a single query with multiple $or clauses, so you only run a single query still.

Our current work around is to copy "searchField" into another field called "sortField".

See my alternate working solution if you can avoid case insensitive options to $regex then wrapping regex _dr and regex _DR in an $or also works. That doesn't require creating a second field with identical value.

Asya

Comment by Nicolas [ 20/Oct/14 ]

I mean that we would need to repeat regex query X times instead of writing it down only once.
Our current work around is to copy "searchField" into another field called "sortField".
We are searching on searchField and sorting on sortField even if they have the same content.
It is an ugly solution, but it's working. Waiting for your fix.
Thanks

Comment by Asya Kamsky [ 20/Oct/14 ]

Yeah, but the problem is that we can have a LOT of channelIds. It would make the query REALLY big for nothing

I'm not sure what you mean by "really big" and "for nothing" is relative. Your other options would be to replace the index with one on just channelId and have mongod do in memory sort (if result set expected is not huge that might be okay).

Another work-around (if you only care about matching "_dr" and "_DR" (as far as using case insensitive regex) would be to use this syntax:

db.searchSeriesOnChannel1.find({channelId:{$in:["CTVTo", "CTVHT"]}, $or:[{searchField:{$regex:"_dr"}},{searchField:{$regex:"_DR"}}]}).sort({searchField:1})

Comment by David Storch [ 20/Oct/14 ]

I can reproduce the issue with this simple script:

var t = db.t;
t.drop();
t.ensureIndex({"channelId" : 1, "searchField" : 1});
t.insert({_id: 1, searchField: "foo", channelId: "CTVTo"});
 
// When we force a collection scan, this query correctly returns 0 results.
var count = t.find({$and : [{ "searchField" : { "$regex" : "_dr" , "$options" : "i"}, "channelId" : { "$in" : [ "CTVTo" , "CTVHT"]}}]}).sort({searchField:1}).hint({$natural: 1}).itcount();
assert.eq(0, count);
 
// Without forcing a collection scan, a merge sort plan is constructed. This plan incorrectly returns 1
// result, causing the assertion to fail.
count = t.find({$and : [{ "searchField" : { "$regex" : "_dr" , "$options" : "i"}, "channelId" : { "$in" : [ "CTVTo" , "CTVHT"]}}]}).sort({searchField:1}).itcount();
assert.eq(0, count);

Explaining the query against a 2.7.7 server shows that the $regex filter is being discarded:

t.find({$and : [{ "searchField" : { "$regex" : "_dr" , "$options" : "i"}, "channelId" : { "$in" : [ "CTVTo" , "CTVHT"]}}]}).sort({searchField:1}).explain()
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"parsedQuery" : {
			"$and" : [
				{
					"searchField" : /_dr/i
				},
				{
					"channelId" : {
						"$in" : [
							"CTVHT",
							"CTVTo"
						]
					}
				}
			]
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "SORT_MERGE",
				"sortPattern" : {
					"searchField" : 1
				},
				"inputStages" : [
					{
						"stage" : "IXSCAN",
						"keyPattern" : {
							"channelId" : 1,
							"searchField" : 1
						},
						"isMultiKey" : false,
						"direction" : "forward",
						"indexBounds" : "field #0['channelId']: [\"CTVHT\", \"CTVHT\"], field #1['searchField']: [\"\", {}), [/_dr/i, /_dr/i]"
					},
					{
						"stage" : "IXSCAN",
						"keyPattern" : {
							"channelId" : 1,
							"searchField" : 1
						},
						"isMultiKey" : false,
						"direction" : "forward",
						"indexBounds" : "field #0['channelId']: [\"CTVTo\", \"CTVTo\"], field #1['searchField']: [\"\", {}), [/_dr/i, /_dr/i]"
					}
				]
			}
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "Macintosh.local",
		"port" : 27017,
		"version" : "2.7.7",
		"gitVersion" : "afae7c082b1b4eff8401f660124e161137fe1d2b",
		"os" : {
			"type" : "Darwin",
			"name" : "Mac OS X",
			"version" : "13.4.0\u0000"
		}
	},
	"ok" : 1
}

The problem is that the stage labeled SORT_MERGE is not filtering the results according to the regular expression.

Comment by Nicolas [ 20/Oct/14 ]

Yeah, but the problem is that we can have a LOT of channelIds. It would make the query REALLY big for nothing

Comment by Asya Kamsky [ 18/Oct/14 ]

Ironically, the performance should be better (the index will be used properly then - I think the bug may in how it decides the index should apply to original query which ought to be identical to re-written query but isn't).

Comment by Simon Lavigne-Giroux [ 18/Oct/14 ]

Are the performance going to be the same?

Comment by Asya Kamsky [ 18/Oct/14 ]

Possible workaround:

Transforming the query:

find({channelId:{$in:["CTVTo", "CTVHT"]}, searchField:{$regex:"_dr"}}).sort({searchField:1})

into

find({$or:[ {channelId:"CTVTo", searchField:{$regex:"_dr"}}, {channelId:"CTVHT", searchField:{$regex:"_dr"}} ]}).sort({searchField:1})

returns the correct results.

Comment by Simon Lavigne-Giroux [ 18/Oct/14 ]

Yes that was my observation as well. They are all necessary to trigger the incorrect behavior.

Comment by Asya Kamsky [ 18/Oct/14 ]

simonlg

Thank you for reporting the problem and for the detailed reproducer. We have been able to confirm this bug - it seems that the $in clause on the leading part of the index and the sort on the second field of the index (and $regex on second field) are all necessary to trigger the incorrect behavior, was that your observation as well?

Asya

Generated at Thu Feb 08 03:38:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.