[SERVER-12690] French stop-word list contains non-stop words Created: 12/Feb/14 Updated: 28/Dec/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Text Search |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Arthur Darcet | Assignee: | Backlog - Query Integration |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | pull-request, qi-text-search, query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Query Integration
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
The current version of the french stop words aucun (none) |
| Comments |
| Comment by Arthur Darcet [ 18/Feb/14 ] |
|
Ok, user-defined stopwords would be great indeed. If there currently isn't any guidelines/standard for the stopwords, maybe the NLTK stopwords package should be considered ? https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip |
| Comment by Daniel Pasette (Inactive) [ 18/Feb/14 ] |
|
arthur, thanks for the pull request, but we aren't able to change the stop words list at this time even though I agree your changes are valid (there is unfortunately no one standard for stopwords). What is really needed is to allow versioned text indexes based on the stopword list and allowing user-defined stopword lists. See SERVER-10062. We need to put this pull request on hold until this issue is resolved. The problem is that if a word is removed from the stop list, documents subsequently added to the collection will have entries in the index for that word, which will lead to inconsistent search results. Conversely, if a word is added to the stoplist and a document containing that term is deleted, it will leave orphan entries in the index. |
| Comment by Arthur Darcet [ 12/Feb/14 ] |
|
I created a pull request removing those terms from the list : https://github.com/mongodb/mongo/pull/633 |