-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.6.8
-
Component/s: Text Search
-
Query Integration
Description
An AND query in MongoDB FTS seems to be implemented as fetching all of the words separately, then taking an intersection. This is not efficient when:
- A lot of search words are given. (e.g 5-10 instead of 1-2.) In theory this should help narrow down the result set and make the query faster, but in MongoDB this will always make the query slower. (``nscanned`` and ``nscannedObjects`` will grow even if the result set ``n`` becomes smaller.)
- One or more very common search words are given.
For example, consider we search a database with people's addresses with the following words::
db.somecollection.find( { $text : { $search : '"John" "Ezequiel" "Smith" "New" "York"' } } )
In the above example, one of the words - "Ezequiel" - is clearly an uncommon name, which will occur with low frequency in the database. On the other hand the names "John Smith" and "New York" are clearly very common ones, especially the number of entries for "New York" might very well be in the millions.
What should happen
An efficient query plan for MongoDB would be to just search the FTS index for "Ezequiel", then filter the result set of that to exclude documents that do not include all of the other search words.
Is it possible to deduce from our btree indexes how many occurrences there are of the (after stemming) identical value? If yes, the query planner should consider this and only scan the index for values where this number is low (and therefore selectiveness high).
If such statistics is not possible, one possible way to implement this is analogous to how the query planner selects execution plans for regular queries. Execute one thread per search word in parallel, each thread scanning the index for the given word. When the fastest thread finishes scanning the index, start fetching the documents pointed to, and filter out documents that don't match all of the given AND search words. When this is finished, return this result set and kill the other threads that are still scanning the index.
In the above example, query exectuion would start by launching a separate thread for each word to be scanned from the index. The thread scanning for "Ezequiel" would complete first and would then proceed to fetching the documents containing "Ezequiel" and apply filtering on those documents. Once this completes, the threads scanning for "John", "Smith", "New" and "York" would simply be terminated and their results discarded.
Workaround
Note that with the current version of MongoDB, the latter method above can also be implemented on the application level, as a workaround to optimize slow FTS queries.