Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.6.8
Component/s: Text Search
Labels:
- qi-text-search

Assigned Teams:

Query Integration
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Description

An AND query in MongoDB FTS seems to be implemented as fetching all of the words separately, then taking an intersection. This is not efficient when:

A lot of search words are given. (e.g 5-10 instead of 1-2.) In theory this should help narrow down the result set and make the query faster, but in MongoDB this will always make the query slower. (``nscanned`` and ``nscannedObjects`` will grow even if the result set ``n`` becomes smaller.)

One or more very common search words are given.

For example, consider we search a database with people's addresses with the following words::

    db.somecollection.find( { $text : { $search : '"John" "Ezequiel" "Smith" "New" "York"' } } )

In the above example, one of the words - "Ezequiel" - is clearly an uncommon name, which will occur with low frequency in the database. On the other hand the names "John Smith" and "New York" are clearly very common ones, especially the number of entries for "New York" might very well be in the millions.

What should happen

An efficient query plan for MongoDB would be to just search the FTS index for "Ezequiel", then filter the result set of that to exclude documents that do not include all of the other search words.

Is it possible to deduce from our btree indexes how many occurrences there are of the (after stemming) identical value? If yes, the query planner should consider this and only scan the index for values where this number is low (and therefore selectiveness high).

If such statistics is not possible, one possible way to implement this is analogous to how the query planner selects execution plans for regular queries. Execute one thread per search word in parallel, each thread scanning the index for the given word. When the fastest thread finishes scanning the index, start fetching the documents pointed to, and filter out documents that don't match all of the given AND search words. When this is finished, return this result set and kill the other threads that are still scanning the index.

In the above example, query exectuion would start by launching a separate thread for each word to be scanned from the index. The thread scanning for "Ezequiel" would complete first and would then proceed to fetching the documents containing "Ezequiel" and apply filtering on those documents. Once this completes, the threads scanning for "John", "Smith", "New" and "York" would simply be terminated and their results discarded.

Workaround

Note that with the current version of MongoDB, the latter method above can also be implemented on the application level, as a workaround to optimize slow FTS queries.

Assignee:: [DO NOT USE] Backlog - Query Integration
Reporter:: Henrik Ingo (Inactive)
Participants:: [DO NOT USE] Backlog - Query Integration, Henrik Ingo
Votes:: 2 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Mar 11 2015 07:38:34 AM UTC
Updated:: Dec 28 2023 06:34:39 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates