[JAVA-1791] Full text searching a Turkish word using mongodb-java-driver does not work Created: 01/May/15  Updated: 11/Sep/19  Resolved: 01/May/15

Status: Closed
Project: Java Driver
Component/s: None
Affects Version/s: 3.0.0
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Hakan Özler Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

My System settings:
Windows 7 64-bit
6GB RAM DDR3, Intel Core i7 1.73GHz
MongoDB version 3.0
Mongo Java Driver version: mongo-java-driver:3.0.0



 Description   

I have an index on the text field with its default language which is Turkish. When I query in the mongo shell I get the total number which is 17 using the following script:

> use newspaper
> db.news.getIndices()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "newspaper.news"
        },
        {
                "v" : 1,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "name" : "text_text",
                "ns" : "newspaper.news",
                "default_language" : "turkish",
                "weights" : {
                        "text" : 1
                },
                "language_override" : "language",
                "textIndexVersion" : 2
        }
]
> db.news.find({$text:{$search: "maç"}}).count()
17

When attempting to use the same query int the same db in Java using MongoDB Java Driver, I get 0 result. Here is the code snippet that I use:

final MongoClient mongoClient = new MongoClient(new MongoClientURI("mongodb://localhost"));
        final MongoDatabase newspaper = mongoClient.getDatabase("newspaper");
final MongoCollection<Document> news= newspaper.getCollection("news");
Document textSearch = new Document("$text", new Document("$search", "maç"));
long count = news.count(textSearch);
System.out.println(count);

I found that this only happens because of the special turkish characters when they are included in a word, here are the turkish characters that we use day in and day out: " ı, ç, ü, ö, ş, ğ ". But when I type a word that does not contain any of them, let's say, "hafta" (eng: "week") I get the same result in both mongo shell and java.



 Comments   
Comment by Jeffrey Yemin [ 01/May/15 ]

No worries. Glad we were able to work through it.

Comment by Hakan Özler [ 01/May/15 ]

Thanks Jeff, I just realise, sorry for taking your time.

Comment by Jeffrey Yemin [ 01/May/15 ]

In IntelliJ preferences, please try configuring Editor->File Encodings to UTF-8 for your project.

Comment by Hakan Özler [ 01/May/15 ]

I actually run the code on Intellij, and when I look the commands that it uses for the file, I see this one " -Dfile.encoding=windows-1254"

Comment by Jeffrey Yemin [ 01/May/15 ]

Try a character-by-character comparison of "maç" and "ma\u00e7", as this looks like a character encoding issue during compilation. Are you setting the character encoding of your source file with the -encoding option on javac?

Comment by Hakan Özler [ 01/May/15 ]

Hi, I am now getting the result when specifying 'ç' as unicode character. But the last statement returns with NullPointerException.

Comment by Jeffrey Yemin [ 01/May/15 ]

I'm not able to reproduce this with the following test program:

        MongoClient client = new MongoClient();
        MongoCollection<Document> collection = client.getDatabase("test").getCollection("JAVA1791");
        collection.drop();
 
        collection.createIndex(new Document("comments", "text"), new IndexOptions().defaultLanguage("turkish"));
 
        // insert with unicode representation of the character
        collection.insertOne(new Document("_id", 1).append("comments", "this is a ma\u00e7 right?"));
 
        // query with Unicode representation
        Document document = collection.find(new Document("$text", new Document("$search", "ma\u00e7"))).first();
        System.out.println(document.toJson());
 
        // query with literal character representation
        document = collection.find(new Document("$text", new Document("$search", "maç"))).first();
        System.out.println(document.toJson());

which outputs the following:

{ "_id" : 1, "comments" : "this is a maç right?" }
{ "_id" : 1, "comments" : "this is a maç right?" }

Can you reproduce these results?

Generated at Thu Feb 08 08:55:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.