[SERVER-17144] Use text search suffix field predicates to limit index scan Created: 02/Feb/15  Updated: 06/Dec/22  Resolved: 18/Jan/18

Status: Closed
Project: Core Server
Component/s: Querying, Text Search
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Akira Kurogane Assignee: Backlog - Query Team (Inactive)
Resolution: Won't Fix Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Assigned Teams:
Query
Backwards Compatibility: Fully Compatible
Participants:

 Description   

To set the terminology:

In the following index description

{a: 1, b: 1, c: "text", d: 1, e: 1}

the text index prefix is {a: 1, b: 1} and the text index suffix is {d: 1, e: 1}.

Predicates over the fields in the text index suffix are not used to constrain the number of documents index-scanned during a full-text search.

For example consider a collection where the search predicate

{a: 123, b: 456, $text: {$search: 'foo bar'}} 

would find 30 documents, and

{a: 123, b: 456, $text: {$search: 'foo bar'}, d: {$gte: 0}} 

would find only 10 documents.

During the second query the query optimizer does not use the 'd' predicate to avoid scanning the other 20 documents during the query execution. The "nscanned" value will be 30 for both, even though "n" is different.



 Comments   
Comment by David Storch [ 02/Feb/15 ]

Here are some more details regarding this issue. Let's break down what happens when you run the following:

t.drop();
t.ensureIndex({a: 1, b: "text", c: 1});
t.insert({a: 0, b: "two words", c: 6});
t.find({a: 0, $text: {$search: "two words"}, c: {$gte: 0}});

When the insert occurs, the FTS key generation code builds the index keys that will be added to the text index. The text index key format is (prefix, term, score, suffix), with one key generated per tokenized and stemmed term. In this case, the keys will be something like:

{'': 0, '': 'two', '': 0.75, '': 6}
{'': 0, '': 'word', '': 0.75, '': 6}

The field names are stripped out of index keys, but note that the first field is a, the second is the stemmed term, the third is the term's score, and the fourth is c.

Now let's consider how the query system answers the following query:

t.find({a: 0, $text: {$search: "two words"}, c: {$gte: 0}});

This query goes through special $text planning in order to ensure that there is a text index available and that there are equality predicates over each field in the text index prefix. The planner separates these equality predicates over the prefix from any other predicates that can use the text index. The new explain for 3.0 shows that this is how the query is planned:

{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.t",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"$and" : [
				{
					"a" : {
						"$eq" : 0
					}
				},
				{
					"c" : {
						"$gte" : 0
					}
				},
				{
					"$text" : {
						"$search" : "two words",
						"$language" : ""
					}
				}
			]
		},
		"winningPlan" : {
			"stage" : "TEXT",
			"filter" : {
				"c" : {
					"$gte" : 0
				}
			},
			"indexPrefix" : {
				"a" : 0
			},
			"indexName" : "a_1_b_text_c_1",
			"parsedTextQuery" : {
 
			}
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "dstorch-desktop",
		"port" : 27017,
		"version" : "3.1.0-pre-",
		"gitVersion" : "94cb08d1d1a14733ebe875f941f4ec1eb8a44b91"
	},
	"ok" : 1
}

Note that the winning plan is a TEXT stage where {a: 0} is the index prefix, and {c: {$gte: 0}} is a separate filter. The reason for this should become clear when I describe how the query execution framework executes the TEXT stage.

The first thing that happens in TEXT execution is that a separate index scan is produced for each of the search terms. In this case there will be one index scan for "two" and another for "words". Each index scan is set up to scan a contiguous range of the text index, so the two scans will have bounds that look something like this:

{a: [[0, 0]], _fts: [["two", "two"]], _ftsx: [[MinKey, MaxKey]], c: [[MinKey, MaxKey]]}
{a: [[0, 0]], _fts: [["word", "word"]], _ftsx: [[MinKey, MaxKey]], c: [[MinKey, MaxKey]]}

Now hopefully you can see why we need equality predicates over the prefix fields: this ensures that we can construct a single-range index scan. By the same token, the bounds for the filter {c: {$gte: 0}} are not incorporated here because this would cause the index scan to involve multiple ranges. This limitation of the implementation could probably be relaxed to incorporate the trailing predicates into the bounds for each scan.

The first scan will find the key generated for the term "two" and the second scan will find the key generated for the term "word". Both of these keys point back to the same document in the collection. The scores for these two keys will be aggregated in order to compute the text score, after being passed through the filter on c to ensure that the key matches the query. Finally, the document is returned along with text score metadata.

Generated at Thu Feb 08 03:43:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.