[SERVER-9953] Text search: dutch stemmer not working? Created: 18/Jun/13  Updated: 28/Dec/23

Status: Backlog
Project: Core Server
Component/s: Text Search
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Erik Pragt Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 0
Labels: qi-text-search, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-9537 Full text search in Dutch does incorr... Closed
Assigned Teams:
Query Integration
Participants:

 Description   

Hi all,

I'm using MongoDB text search, and I'd like to give some feedback. I'm not sure what the best way is to do so, so I've made this report. If there's a more preferred way, please let me know, so I can use that way in the future.

Based on this document: http://docs.mongodb.org/manual/tutorial/create-text-index-on-multi-language-collection/, I've made some testcase, and I don't understand what's happening.

This is my test data:

{ "_id" : 1, "language" : "portuguese", "quote" : "A sorte protege os audazes" }
{ "_id" : 2, "language" : "spanish", "quote" : "Nada hay más surreal que la realidad." }
{ "_id" : 3, "language" : "english", "quote" : "is this a dagger which I see before me" }
{ "_id" : 4, "language" : "dutch", "quote" : "is dit een dolk die ik voor mij zie" }
{ "_id" : 5, "language" : "dutch", "quote" : "vol verbijstering zaten de dames naar de twee honden te kijken" }

And I'm most interested in finding the Dutch results right now.

It seems like the stemmer is not working for some words:

> db.quotes.runCommand( "text", { search: "honden", language:"dutch" } )
Correct result: 1 (queryDebugString: 'hond')
> db.quotes.runCommand( "text", { search: "hond", language:"dutch" } )
Correct result: 1 (queryDebugString: 'hond')
 db.quotes.runCommand( "text", { search: "dames", language:"dutch" } )
Correct result: 1 (queryDebugString: 'dames')
 db.quotes.runCommand( "text", { search: "dame", language:"dutch" } )
Incorrect result: 0 (queryDebugString: 'dam')

Note that the plural for hond ('dog') is honden (dogs)
The plural for dame ('lady') is dames (ladies)

However, MongoDB text search doesn't seem to understand this, and returns nothing. In my opinion, this seems like a bug?



 Comments   
Comment by Miguel G [ 16/Nov/13 ]

Hi there

I am also having issues with the Spanish text search:

The stemmer apparently removes the 'o' at the end of each word (we have quite a few words which end in 'o' so you can see how problematic this is

So if I run this query: db.collection.runCommand( "text",

{ search: "barco", language:"spanish" }

)

I get the following output, and no results even though there's a field containing the word 'barco' (notice how the 'o' has been removed in the queryDebugString field):

{
"queryDebugString" : "barc||||||",
"language" : "spanish",
"results" : [ ],
"stats" :

{ "nscanned" : 0, "nscannedObjects" : 0, "n" : 0, "nfound" : 0, "timeMicros" : 1208 }

,
"ok" : 1
}

But if I run the same query but choosing english as language: db.collection.runCommand( "text",

{ search: "barco", language:"english" }

)

I get a result (notice that the 'o' has not been removed this time)

{
"queryDebugString" : "barco||||||",
"language" : "english",
"results" : [
{
"score" : 1.1,
"obj" : {
"_id" : ObjectId("527822523dd360464b4fd1d7"),
...
}

Any idea why the 'o' is being removed in spanish?

Many thanks

Comment by Amalia Hawkins [ 07/Nov/13 ]

Hi, Erik. Sorry for the delay in response! Since this is an issue with the stemmer we use, the solution is dependent on (a) Snowball modifying the stemmer, or (b) MongoDB text search switching to a new stemmer. We will keep it in mind as we move forward with the feature. Thank you for your help!

Comment by Daniel Pasette (Inactive) [ 25/Jun/13 ]

Need to check if stemmer issues can be reported upstream to snowball, which is the stemmer used by text search.

Generated at Thu Feb 08 03:21:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.