[SERVER-17144] Use text search suffix field predicates to limit index scan Created: 02/Feb/15 Updated: 06/Dec/22 Resolved: 18/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying, Text Search |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Akira Kurogane | Assignee: | Backlog - Query Team (Inactive) |
| Resolution: | Won't Fix | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Participants: | |||||||||
| Description |
|
To set the terminology: In the following index description
the text index prefix is {a: 1, b: 1} and the text index suffix is {d: 1, e: 1}. Predicates over the fields in the text index suffix are not used to constrain the number of documents index-scanned during a full-text search. For example consider a collection where the search predicate
would find 30 documents, and
would find only 10 documents. During the second query the query optimizer does not use the 'd' predicate to avoid scanning the other 20 documents during the query execution. The "nscanned" value will be 30 for both, even though "n" is different. |
| Comments |
| Comment by David Storch [ 02/Feb/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here are some more details regarding this issue. Let's break down what happens when you run the following:
When the insert occurs, the FTS key generation code builds the index keys that will be added to the text index. The text index key format is (prefix, term, score, suffix), with one key generated per tokenized and stemmed term. In this case, the keys will be something like:
The field names are stripped out of index keys, but note that the first field is a, the second is the stemmed term, the third is the term's score, and the fourth is c. Now let's consider how the query system answers the following query:
This query goes through special $text planning in order to ensure that there is a text index available and that there are equality predicates over each field in the text index prefix. The planner separates these equality predicates over the prefix from any other predicates that can use the text index. The new explain for 3.0 shows that this is how the query is planned:
Note that the winning plan is a TEXT stage where {a: 0} is the index prefix, and {c: {$gte: 0}} is a separate filter. The reason for this should become clear when I describe how the query execution framework executes the TEXT stage. The first thing that happens in TEXT execution is that a separate index scan is produced for each of the search terms. In this case there will be one index scan for "two" and another for "words". Each index scan is set up to scan a contiguous range of the text index, so the two scans will have bounds that look something like this:
Now hopefully you can see why we need equality predicates over the prefix fields: this ensures that we can construct a single-range index scan. By the same token, the bounds for the filter {c: {$gte: 0}} are not incorporated here because this would cause the index scan to involve multiple ranges. This limitation of the implementation could probably be relaxed to incorporate the trailing predicates into the bounds for each scan. The first scan will find the key generated for the term "two" and the second scan will find the key generated for the term "word". Both of these keys point back to the same document in the collection. The scores for these two keys will be aggregated in order to compute the text score, after being passed through the filter on c to ensure that the key matches the query. Finally, the document is returned along with text score metadata. |