[SERVER-29598] Support Korean language in full text search Created: 13/Jun/17  Updated: 27/Dec/23

Status: Backlog
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: 아나 하리 Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 6
Labels: qi-text-search
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-45859 Text Indexes with partial word match ... Closed
Assigned Teams:
Query Integration
Participants:
Case:

 Description   

Add Korean to languages supported in MongoDB FTS.

Original description:
First of all, MongoDB support stemming for major language like english.
But there's no stemming for CJK (Especially I am focusing on Korean). So MongoDB text search is useless for korean language unless stemming Korean in application code.

I am not sure you are interested in Korean,
Anyway Korean use only suffix(postpositional word) after stem(base word) like ..

Stem : 한글
With suffix : 한글은, 한글이, 한글을, 한글과, 한글도, 한글처럼, ...

But current MongoDB implementation, MongoDB search exact match with search term. So Korean word does not matched because of suffix("은", "는", "이", "가", "처럼", ...)

So if MongoDB support range search for text search like below example, We (Korean) can use text-search for Korean language.

Text : "한글은 뛰어난 언어입니다."
Search term : "한글"
Range search in Text-search : "한글" <= range < "한긁" 
  (where "한긁" is generated simple increment of last character of search term, [like this|https://github.com/mongodb/mongo/pull/1151/commits/641c3041282746aff280b685424d55926bab93b2#diff-bc6db30f2a5f9618496534d03aeabf54R108])

Of course, this feature is not needed for language which has stemming.
So I want you add knob to enable or disable this range search for text-search (and default is false). Then we can use text search with this knob=true for Korean language.

I pushed pull-request for this simple idea to MongoDB github

This feature will save a lot of Korean guys. Please consider adding this feature seriously.
(I am not sure this feature is useful for Japanese or China which does not have space in phrase)

Thanks.



 Comments   
Comment by 아나 하리 [ 30/Jun/17 ]

Hi Asya.

>> I'm going to convert this ticket into a new feature request for MongoDB to add proper text search support for Korean language.
Sure, I just added simple code to explain how mongodb can support korean full text search without stemming. And also my pull-request is not complete patch.

Anyway, I hope "SERVER-15090" is implemented sooner or later.

Thanks.

Comment by Asya Kamsky [ 28/Jun/17 ]

matt.lee,

You are correct, MongoDB text search currently does not provide support for Korean (you can see the list of currently supported languages here).

The best solution would be for us to add support for Korean, which would include support for appropriate stemming and stop words. As you found, if the language is not supported, text search uses simple tokenization with no list of stop words and no stemming.

Your proposed pull request tries to implement prefix text search, a new feature we are already tracking in SERVER-15090, however, we cannot accept the pull request for several reasons:

  • there are no tests included, so there is no way to make sure that the changes didn't break existing functionality
  • text indexes can be part of compound indexes and the proposed changes don't look like they would work correctly with a compound index
  • please see our contributor guidelines for other requirements, like contributor agreement, coding style, etc.

Since we already have a JIRA ticket for prefix search, I think the proposed work for that feature should be tracked there. I'm going to convert this ticket into a new feature request for MongoDB to add proper text search support for Korean language.

Thanks for your interest in MongoDB.

Regards,
Asya Kamsky
Lead Product Manager, MongoDB Server

Comment by Mark Agarunov [ 16/Jun/17 ]

Hello matt.lee,

Thank you for providing the detailed example. I've set the fixVersion on this ticket to "Needs Triage" for this new feature to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen.

Thanks,
Mark

Generated at Thu Feb 08 04:21:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.