[SERVER-17535] FTS doesn't match full words when doing filtering on the set found from index scan Created: 11/Mar/15  Updated: 28/Dec/23

Status: Backlog
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Henrik Ingo (Inactive) Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 1
Labels: qi-text-search, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Query Integration
Operating System: ALL
Steps To Reproduce:

sputnik-rs:PRIMARY> db.ftstest.insert( { data1 : "abcd", data2 : "efgh", data3 : "ijkl" } )
WriteResult({ "nInserted" : 1 })
sputnik-rs:PRIMARY> db.ftstest.createIndex( { data1 : "text", data2 : "text", data3 : "text" } )
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efgh" "ijkl"' } } )
{ "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abc" "efg" "ijk"' } } )
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efg" "ijk"' } } )
{ "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"bcd" "fgh" "jkl"' } } )
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "fgh" "jkl"' } } )
{ "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
sputnik-rs:PRIMARY>

Participants:

 Description   

The following behavior with FTS seems inconsistent:

sputnik-rs:PRIMARY> db.ftstest.insert( { data1 : "abcd", data2 : "efgh", data3 : "ijkl" } )
WriteResult({ "nInserted" : 1 })
sputnik-rs:PRIMARY> db.ftstest.createIndex( { data1 : "text", data2 : "text", data3 : "text" } )
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efgh" "ijkl"' } } )
{ "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abc" "efg" "ijk"' } } )
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efg" "ijk"' } } )
{ "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"bcd" "fgh" "jkl"' } } )
sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "fgh" "jkl"' } } )
{ "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
sputnik-rs:PRIMARY>

What happens above:

  1. In the first query, all search words match and the document is returned. This is expected.
  2. In the second query, we removed the last letter of each word. As a result they no longer match the full words and nothing is returned. This is expected.
  3. In the third query, the first word is again the full 4 letter word, but the two others are part of the word. Since this is an AND search, this should return an empty set, but returns the document because the second and third words match part of the words in the document.
  4. In the fifth and sixth queries the same is demonstrated when removing the first letter of all or some words respectively.

It seems to me that when scanning the index, MongoDB will match for full words (post stemming). This is expected. However, for documents found from the index scan a filtering step is executed, which actually matches parts of words.

Without looking at the code, I recognize that this is a common error when using regular expression libraries. For matching full words the syntax ^abcd$ should be used, but a developer may easily forget that and just search for abcd, which will match any strings that includes abcd as a part of itself.



 Comments   
Comment by David Storch [ 12/Mar/15 ]

My bad, you're right. Even single-term phrase queries form a logical AND with the rest of the query terms.

Comment by Henrik Ingo (Inactive) [ 12/Mar/15 ]

Same response: Quoted words are requried/AND search terms. See https://jira.mongodb.org/browse/SERVER-17533?focusedCommentId=849902&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-849902

Comment by David Storch [ 11/Mar/15 ]

Hi henrik.ingo@10gen.com,

I believe that this is working as designed. Quoting from the documentation for the $text query operator:

$search: A string of terms that MongoDB parses and uses to query the text index. MongoDB performs a logical OR search of the terms unless specified as a phrase. See Behavior for more information on the field.

The queries above specify a search for three individual terms (not a phrase), and therefore will act as a logical OR. This is much like a keyword search in a web search engine, where a document matches if any of the keywords are found in the document. The more keywords match, the higher the relevance score.

I am going to close as Works as Designed, but please re-open if you have any further questions or concerns.

Best,
Dave

Generated at Thu Feb 08 03:44:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.