[SERVER-8341] Stemming and stop word deletion in phrases Created: 25/Jan/13  Updated: 19/Mar/13  Resolved: 01/Feb/13

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 2.3.2
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Matt Bates Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mac OS 10.7.5.
MongoDB 2.3.2


Operating System: ALL
Participants:

 Description   

Having indexed the Enron email dataset, with a two-field text index with default weightings, I'm seeing unexpected behaviour when searching for phrases. It appears terms are being stemmed and stop words removed within a phrase:

> db.getCollection("emails").runCommand("text",

{ "search" : "\"the scrimmage\"",limit:1 }

);

"queryDebugString" : "scrimmag||||the scrimmage||"

> db.getCollection("emails").runCommand("text",

{ "search" : '"the scrimmage"', limit:1 }

);

"queryDebugString" : "scrimmag||||the scrimmage||"

This behaviour was first spotted by a MongoDB user at the FTS Hackathon in London.



 Comments   
Comment by J Rassi [ 29/Jan/13 ]

> How would it be possible to search exactly for a phrase - without stop word removal and stemming?

The phrase search you pasted above will do exactly this.

Search queries are reduced to a list of stemmed terms, which are used to query into the index. This list includes words from phrases and also words not inside phrases. Stopwords are not included. For your example, this list will be the singleton (scrimmag). After the results come back from the index, the matcher is invoked to filter out documents that do not include the specified phrases.

I'll mark this ticket for review by the docs team.

Comment by Matt Bates [ 28/Jan/13 ]

I read the release notes for 2.4 (2.3.2) and it states that it does not stem phrases or negations. Maybe the documentation is wrong and needs correcting?

How would it be possible to search exactly for a phrase - without stop word removal and stemming? That's what was desired in the queries above ('the scrimmage') and was asked/expected at the hackathon.

Comment by Eliot Horowitz (Inactive) [ 25/Jan/13 ]

That is as designed.
Are you seeing results that are incorrect?

Generated at Thu Feb 08 03:17:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.