[SERVER-30726] inconsistent treatment of stopwords with language="none" Created: 17/Aug/17  Updated: 27/Oct/23  Resolved: 01/Dec/17

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rajhans Samdani Assignee: Kyle Suarez
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-29918 stemming behavior for diacritics caus... Closed
Operating System: ALL
Sprint: Query 2017-10-23, Query 2017-11-13, Query 2017-12-04
Participants:

 Description   

I am using mongo db 3.2.8 on mac os. Here is a bug I'm facing when using language="none". My desired result was that when setting language="none", mongo DB will stop ignoring stopwords.

I have created a collection 'resources' with text index on a field called 'title' with default language (English.)

First, I create a document with title = 'What are you?'.

  1. When I search with db.getCollection('resources').find({$text:{$search: 'what are you?'}}) I get no results as expected because all the words are stopwords.
  2. When I search with
    db.getCollection('resources').find({$text:{$search: 'what are you?'}}, $language:"none"), I still get no results. Ideally, I'd like this to return the document.

Now I modified the document to have title = 'Whats are you?' (notice the typo 'Whats')

  1. When I search with
    db.getCollection('resources').find({$text:{$search: 'whats are you?'}}, $language:"none"), I do not get this document even though there is an exact match and 'whats' is presumably not a stopword!
  2. When I search with
    db.getCollection('resources').find({$text:{$search: 'what are you?'}}, $language:"none"), I get this document in return!
  3. Language="en" is working fine: when I search with db.getCollection('resources').find({$text:{$search: 'whats are you?'}}) I get the result and when searching with db.getCollection('resources').find({$text:{$search: 'what are you?'}}), I don't.


 Comments   
Comment by Rajhans Samdani [ 23/Oct/17 ]

Makes sense. Thanks!
Feel free to close this issue.

Comment by Kelsey Schubert [ 20/Sep/17 ]

Hi rajhans,

Sorry for the delay getting back to you. The issue you're observing is the result of how text indexes are currently stored and queried against.

When text indexes are created, mongod will stem each word that isn't a stop word according to the language rules. These stemmed words are the tokens used as the index keys for all subsequent queries and point to their corresponding documents.

When MongoDB queries a text index, it first removes the stop words and then stems the remaining words in the phrase according to the language rules that were passed in (or falls back to the default of english).

If the stop word/stemming rules the text index differ from the stop word/stemming rules of the query (e.g. a language is used for the query other than the one the index was built with), it is possible that MongoDB be will not find a match since the roots from the stemmed query do not match any tokens in the index.

In your first example, no token "what" is inserted into the index when the document is inserted, as consequence no subsequent query with the root "what" will find the document containing "what".

In your second example, "whats" is stemmed to "what" according to english stemming rules, and mongod updates its text index to include the token "what" pointing to newly inserted document. When mongod does not stem (according to the language rules of none) "whats" it queries on "whats" and cannot find any matching tokens/documents. However, if the query is stemmed to "what" then the document is returned as expected.

We're actively discussing ways to improve this behavior. Please feel free to review a related ticket, SERVER-29918, which discusses a similar stemming issue. For now, I would recommend using the same language for both your queries and text index.

Kind regards,
Kelsey

Generated at Thu Feb 08 04:24:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.