[SERVER-20256] Korean Language Created: 02/Sep/15  Updated: 19/Nov/15  Resolved: 19/Nov/15

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Siamak Assignee: Kelsey Schubert
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: JPEG File Mong-ko.jpg    
Operating System: ALL
Participants:

 Description   

Dear Sir/Madam.

I know that you fix this problem, but i have problem with the unicode for Korean Language, I am trying to import a Korean wikipedia corpus to mongodb in Linunx, but when i want to search a word in mongodb through my java application, It could not find any match word, what i have to do? i tried to convert the corpus to utf-8 in my query and in mongo, but the results were same.



 Comments   
Comment by Kelsey Schubert [ 19/Nov/15 ]

Hi 30yamak,

Sorry for the long delay getting back to you. I have imported your data and successfully queried the term field. Please see the examples below:

db.terms.findOne({term : {$regex : "지수적"}})
{
	"_id" : ObjectId("55e8a06cf02a8168ed428d2f"),
	"term" : "{\"term\":\"지수적\",\"vector\":{\"197761\":0.036434002220630646,\"296370\":0.04846245050430298,\"237083\":0.010118533857166767,\"57801\":0.1235201507806778,\"300077\":0.055651936680078506,\"62474\":0.007019163109362125,\"300030\":0.2067071944475174,\"165881\":0.011536050587892532,\"31741\":0.002140911528840661,\"238158\":0.05690254271030426,\"244254\":0.18086878955364227}}",
	"vector" : BinData(0,"AAAAAA==")
}
 
db.terms.findOne({term : {$regex : "\u110c\u1175\u1109\u116e\u110c\u1165\u11A8"}})
{
	"_id" : ObjectId("55e8a06cf02a8168ed428d2f"),
	"term" : "{\"term\":\"지수적\",\"vector\":{\"197761\":0.036434002220630646,\"296370\":0.04846245050430298,\"237083\":0.010118533857166767,\"57801\":0.1235201507806778,\"300077\":0.055651936680078506,\"62474\":0.007019163109362125,\"300030\":0.2067071944475174,\"165881\":0.011536050587892532,\"31741\":0.002140911528840661,\"238158\":0.05690254271030426,\"244254\":0.18086878955364227}}",
	"vector" : BinData(0,"AAAAAA==")
}

It's worth noting that some fonts may render two symbols as single character. Depending on your font these two symbols may appear the same 지 지. However, one of these characters has two unicodes, whereas the other has only one unique unicode. The unicodes in the document must match the query.

I am closing this ticket since we can't reproduce this issue. If you can share a run-able reproduction script, preferably in javascript, we'll be happy take another look.

Thank you,
Thomas

Comment by Siamak [ 25/Sep/15 ]

No new news?

Comment by Siamak [ 17/Sep/15 ]

Dear Sam, I attached mongodump in my Dropbox. Because its size was more that allowance.

https://dl.dropboxusercontent.com/u/6149013/terms.bson.gz
https://dl.dropboxusercontent.com/u/6149013/terms.metadata.json.gz
https://dl.dropboxusercontent.com/u/6149013/system.indexes.bson.gz

With Best Wishes,
Siamak

Comment by Sam Kleinman (Inactive) [ 16/Sep/15 ]

Can you provide some of your data in the form of a mongodump .bson file? This will allow me to try my reproduction with your data.

Regards,
sam

Comment by Siamak [ 14/Sep/15 ]

Dear Sam, Thank you for your helping,
About your questions:
1. Yes I tired it, but the result was same, NOTHING
2. I tired two version of driver: 2.12.4 and 3.0.3 and for both versions the result was same.
3. I tried it was some
4. No unfortunately. ( I inserted with 2.12.4 and tried to retrieve with 2.12.4 and 3.0.3)
5. I tired several strings the results were same.

I attache my data in mongo
With Best Wishes,
Siamak

Comment by Sam Kleinman (Inactive) [ 11/Sep/15 ]

Sorry for not getting back to sooner.

I've been trying to reproduce this issue with the mongo shell, without luck. You can see my attempt to translate your example here:

(function() {
    "use strict";
 
    var coll = db.getCollection('testColl')
 
    coll.drop();
    assert.eq(0, coll.count());
 
    var ustr = "\uC815\uC2E0\uBCD1\uC6D0"
    coll.insert({"_id": "one", "data": ustr })
 
    assert.eq(1, coll.count());
    assert.eq(1, coll.count({"data": ustr}))
}());

I have some more questions about your issue:

  1. Are you able to reproduce your problem in the mongo shell?
  2. Which driver are you using where you see this issue? Which version of that driver are you using?
  3. Are you able to reproduce this issue with another driver?
  4. If you insert a document with your driver, can you successfully retrieve it with a different client?
  5. Do all of the Korean strings exhibit this error, or is it only some of them? Are you able to use strings pulled form other sources?

Thanks again for your help.

Regards,
sam
regards

Comment by Siamak [ 02/Sep/15 ]

I also tried to find a specific string through \uxxx but the return value was NULL again :
String original = "\uC815\uC2E0\uBCD1\uC6D0";
searchQuery.put("term",original);
DBCursor cursor1 = collection.find(searchQuery);
while (cursor1.hasNext())

{ System.out.println(cursor1.next()); }
Comment by Siamak [ 02/Sep/15 ]

favorite
I have a problem with unicode for Korean Language in mongodb, I am trying to import a Korean wikipedia corpus to mongodb in Linunx, but when i want to search a word in mongodb through my java application, It could not find any match word, what i have to do? i tried to convert the corpus to utf-8 in my query and in mongo, but the results were same.

It is my code that i convert the string to utf-8 when i insert my data to mongodb and find from mongodb

//For Insert to Mongodb:
byte[] utf8Bytes = term.get("term").toString().getBytes("UTF-8");
dbDao.insertVector(new String(utf8Bytes, "UTF-8"), parseVector((Map) term.get("vector")));

//To find From Mongodb
byte[] utf8Bytes = term.get("term").toString().getBytes("UTF-8");
DBObject dbo = termsCollection.findOne(new BasicDBObject(TERM, new String(utf8Bytes, "UTF-8")));
// The return value always is NULL!!!

Generated at Thu Feb 08 03:53:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.