[SERVER-12690] French stop-word list contains non-stop words Created: 12/Feb/14  Updated: 28/Dec/23

Status: Backlog
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Arthur Darcet Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 1
Labels: pull-request, qi-text-search, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-10062 Add user configurable stop word lists... Backlog
Related
is related to SERVER-8393 Review French stop word list Closed
Assigned Teams:
Query Integration
Operating System: ALL
Participants:
Case:

 Description   

The current version of the french stop words
https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/stop_words_french.txt
contains nouns/adjectives/verbs which shouldn't be considered stop words:

aucun (none)
bon (good)
dedans (inside)
dehors (outside)
dos (human back)
droite (right)
début (beginning)
haut (hight)
maintenant (now)
moins (less)
mot (word)
nom (name)
nommé, nommée, nommés (named)
nouveau, nouveaux (new)
parole (speech)
personne, personnes (peoples)
sujet (subject)
valeur (value)
voient, vois, voit (see)
vont (go)
état (state)



 Comments   
Comment by Arthur Darcet [ 18/Feb/14 ]

Ok, user-defined stopwords would be great indeed.

If there currently isn't any guidelines/standard for the stopwords, maybe the NLTK stopwords package should be considered ? https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip
I'm not sure what the license is though, so i don't know if the lists could be included as is.

Comment by Daniel Pasette (Inactive) [ 18/Feb/14 ]

arthur, thanks for the pull request, but we aren't able to change the stop words list at this time even though I agree your changes are valid (there is unfortunately no one standard for stopwords). What is really needed is to allow versioned text indexes based on the stopword list and allowing user-defined stopword lists. See SERVER-10062. We need to put this pull request on hold until this issue is resolved.

The problem is that if a word is removed from the stop list, documents subsequently added to the collection will have entries in the index for that word, which will lead to inconsistent search results. Conversely, if a word is added to the stoplist and a document containing that term is deleted, it will leave orphan entries in the index.

Comment by Arthur Darcet [ 12/Feb/14 ]

I created a pull request removing those terms from the list : https://github.com/mongodb/mongo/pull/633

Generated at Thu Feb 08 03:29:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.