[SERVER-8341] Stemming and stop word deletion in phrases Created: 25/Jan/13 Updated: 19/Mar/13 Resolved: 01/Feb/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | 2.3.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Matt Bates | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Mac OS 10.7.5. |
||
| Operating System: | ALL |
| Participants: |
| Description |
|
Having indexed the Enron email dataset, with a two-field text index with default weightings, I'm seeing unexpected behaviour when searching for phrases. It appears terms are being stemmed and stop words removed within a phrase: > db.getCollection("emails").runCommand("text", { "search" : "\"the scrimmage\"",limit:1 }); "queryDebugString" : "scrimmag||||the scrimmage||" > db.getCollection("emails").runCommand("text", { "search" : '"the scrimmage"', limit:1 }); "queryDebugString" : "scrimmag||||the scrimmage||" This behaviour was first spotted by a MongoDB user at the FTS Hackathon in London. |
| Comments |
| Comment by J Rassi [ 29/Jan/13 ] |
|
> How would it be possible to search exactly for a phrase - without stop word removal and stemming? The phrase search you pasted above will do exactly this. Search queries are reduced to a list of stemmed terms, which are used to query into the index. This list includes words from phrases and also words not inside phrases. Stopwords are not included. For your example, this list will be the singleton (scrimmag). After the results come back from the index, the matcher is invoked to filter out documents that do not include the specified phrases. I'll mark this ticket for review by the docs team. |
| Comment by Matt Bates [ 28/Jan/13 ] |
|
I read the release notes for 2.4 (2.3.2) and it states that it does not stem phrases or negations. Maybe the documentation is wrong and needs correcting? How would it be possible to search exactly for a phrase - without stop word removal and stemming? That's what was desired in the queries above ('the scrimmage') and was asked/expected at the hackathon. |
| Comment by Eliot Horowitz (Inactive) [ 25/Jan/13 ] |
|
That is as designed. |