[SERVER-30726] inconsistent treatment of stopwords with language="none" Created: 17/Aug/17 Updated: 27/Oct/23 Resolved: 01/Dec/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Rajhans Samdani | Assignee: | Kyle Suarez |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Query 2017-10-23, Query 2017-11-13, Query 2017-12-04 | ||||||||
| Participants: | |||||||||
| Description |
|
I am using mongo db 3.2.8 on mac os. Here is a bug I'm facing when using language="none". My desired result was that when setting language="none", mongo DB will stop ignoring stopwords. I have created a collection 'resources' with text index on a field called 'title' with default language (English.) First, I create a document with title = 'What are you?'.
Now I modified the document to have title = 'Whats are you?' (notice the typo 'Whats')
|
| Comments |
| Comment by Rajhans Samdani [ 23/Oct/17 ] |
|
Makes sense. Thanks! |
| Comment by Kelsey Schubert [ 20/Sep/17 ] |
|
Hi rajhans, Sorry for the delay getting back to you. The issue you're observing is the result of how text indexes are currently stored and queried against. When text indexes are created, mongod will stem each word that isn't a stop word according to the language rules. These stemmed words are the tokens used as the index keys for all subsequent queries and point to their corresponding documents. When MongoDB queries a text index, it first removes the stop words and then stems the remaining words in the phrase according to the language rules that were passed in (or falls back to the default of english). If the stop word/stemming rules the text index differ from the stop word/stemming rules of the query (e.g. a language is used for the query other than the one the index was built with), it is possible that MongoDB be will not find a match since the roots from the stemmed query do not match any tokens in the index. In your first example, no token "what" is inserted into the index when the document is inserted, as consequence no subsequent query with the root "what" will find the document containing "what". In your second example, "whats" is stemmed to "what" according to english stemming rules, and mongod updates its text index to include the token "what" pointing to newly inserted document. When mongod does not stem (according to the language rules of none) "whats" it queries on "whats" and cannot find any matching tokens/documents. However, if the query is stemmed to "what" then the document is returned as expected. We're actively discussing ways to improve this behavior. Please feel free to review a related ticket, Kind regards, |