[SERVER-28182] Russian stop word list is missing words Created: 03/Mar/17  Updated: 27/Dec/23

Status: Backlog
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dmitry Ryabtsev Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 1
Labels: qi-text-search, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File stop_words_russian.txt    
Issue Links:
Backports
Assigned Teams:
Query Integration
Operating System: ALL
Backport Requested:
v3.4, v3.2
Participants:

 Description   

As noticed by a community user here, our Russian stop word list (FTS) is broken - specifically there is a number of words that are missing 'п' (U+043F) & 'ч' (U+0447) characters.



 Comments   
Comment by David Storch [ 22/Mar/17 ]

Hi all,

After looking into this issue, we have decided to defer work on a fix. Our stop word lists affect the contents of persistent text index structures. Although this may not pose a severe issue for query correctness, the differing stop word semantics between different versions leads to some problems. For instance, suppose that a Russian text index built on version 3.4 is replicated to a secondary on version 3.6. The secondary would apply its own stop word list, resulting in indexes that both logically and physically diverge on the replica set members. If a failover occurred, and the 3.6 node became primary, text queries using the index would begin to return different results. Furthermore, our index consistency checks across replica set nodes would fail, so you could view this as a form of index corruption.

In short, the stop words list used for a text index should be viewed as an implicit part of that index's definition. Fixing this issue in a safe fashion would require us to introduce a new textIndexVersion. Although we do already have a mechanism for text index versioning, we have no infrastructure for assigning different versions of stop word lists to different index versions. A fix for this ticket would require implementation of versioned stop word lists.

Given this complexity, I am moving the fixVersion of this ticket to "Backlog".

Best,
Dave

Comment by Asya Kamsky [ 15/Mar/17 ]

Once this is fixed, we should consider it for backport to 3.4

Comment by Dmitry Ryabtsev [ 03/Mar/17 ]

All of the broken stop words that I was able to identify

...
вроем  -> втроем (all three)?
...
его
его -> чего (what)
ее   
ее -> еще (more) - that is the only place where it seems that 'щ' (U+0449) is missing
...
еловек -> человек (human)
еред -> перед (in front of)
ерез -> через (over)
...
заем -> зачем (why) vs loan (original meaning)
...
конено -> конечно (sure)
...
ниего -> ничего (nothing)
...
о
о -> по (by/on)
...
осле -> после (after)
...
оти -> почти (almost)
отом -> потом (later/after)
отому -> потому (because)
оять -> опять (again)
...
ри -> при (at)
ро -> про (about)
...
сейас -> сейчас (now)
...
теерь -> теперь (now)
...
то
то -> что (why)
тоб -> чтоб (to)
тобы -> чтобы (to/that)
...
уть -> чуть (a little)
...

Generated at Thu Feb 08 04:17:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.