[CSHARP-4842] String field starting with accented character can't be found by concatenated LINQ. Created: 13/Nov/23  Updated: 22/Nov/23  Resolved: 22/Nov/23

Status: Closed
Project: C# Driver
Component/s: Linq
Affects Version/s: 2.22.0
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Takács Róbert Assignee: Oleksandr Poliakov
Resolution: Works as Designed Votes: 0
Labels: accented, linq, query
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Documentation Changes Summary:

1. What would you like to communicate to the user about this feature?
2. Would you like the user to see examples of the syntax and/or executable code and its output?
3. Which versions of the driver/connector does this apply to?


 Description   

Summary

When a saved string starts with an accented character is being searched with the same word, it will not find the results corresponding to the searched expression.

The issue only happens when the searched value and the searching value are both start with an accented character.

Please provide the version of the driver. If applicable, please provide the MongoDB server version and topology (standalone, replica set, or sharded cluster).

MongoDB Driver - 2.22

MongoDB server - 5.0.10

LINQ - 4.3.0

.NET 7.0

How to Reproduce

In a collection there are documents with a Name property which is a string.

Some of the documents has their Name's set to "Ügyfél" - the important thing here is the starting character, which is accented.

When I'm trying to find across the documents the ones which has "Ügyfél" in their Name's:

string[] names = new string[] { "Ügyfél" };
return *collection*.AsQueryable().Where( c => names.Any( n => c.Name.Contains( n ) ) ); 

The result will be an empty list. - Doesn't work.

 

When I'm trying to find across the documents the ones which has "gyfél" in their Name's:

string[] names = new string[] { "gyfél" }; 
return *collection*.AsQueryable().Where( c => names.Any( n => c.Name.Contains( n ) ) );  

The result will be all the documents which has "gyfél" in their Name's, therefore the ones with "Ügyfél". - Works fine.

 

When I'm trying to find across the documents the ones which has "él" in their Name's:

string[] names = new string[] { "él" }; 
return *collection*.AsQueryable().Where( c => names.Any( n => c.Name.Contains( n ) ) );  

The result will be all the documents which has "él" in their Name's, therefore the ones with "Ügyfél". - Works fine.

 

If I'm using LINQ Expression:

ParameterExpression argParam = Expression.Parameter( typeof( EventModel ), "eventmodel" );
Expression result = null;
string[] names = new string[] { "Ügyfél" };
foreach( string name in names )
{
                Expression<Func<EventModel, bool>> anyExpr = eventModel => eventModel.Name.Contains( name );
                anyExpr = Expression.Lambda<Func<EventModel, bool>>( anyExpr.Body.Replace( anyExpr.Parameters[0], argParam ), argParam );                if( result == null )
                    result = anyExpr.Body;
                else
                    result = Expression.AndAlso( result, anyExpr.Body );
}if( result is not null )
{
                Expression<Func<EventModel, bool>> finalExpression = Expression.Lambda<Func<EventModel, bool>>( result, argParam );
                return source.Where( finalExpression );
}

The result will be all the documents which has "Ügyfél" in their Name's. - Works fine.

Additional Background

The actual query which is being built by the concatenated LINQ is:

{"$match":{"$expr":{"$anyElementTrue":{"$map":{"input": ["Ügyfél"],"as":"n","in":{"$gte":[{"$indexOfCP": ["$Name","$$n"]},0]}}}}}}

While the Expression LINQ builds the following:

{"$match":{"Name": /Ügyfél/is}}



 Comments   
Comment by Oleksandr Poliakov [ 22/Nov/23 ]

Yes, you are right! MongoDB $toLower and $toUpper has well defined behavior only for ASCII characters. Suggestion to support unicode was create quite a while ago, but never got many votes: https://jira.mongodb.org/browse/SERVER-32141

So there is no issues with .Net Driver, I'll close the ticket, but feel free to reopen it if you will need any further assistance.

 

Thanks,

Oleksandr.

Comment by Takács Róbert [ 21/Nov/23 ]

Hi oleksandr.poliakov@mongodb.com !

 

After further testing I think I was able to find the issue. Actually you were right and it is truly about the case-sensivity. I removed that part for the sake of simplicity which in this case hid the bug.

Keeping our previous testing environment I was able to produce the following:

Putting back the ".ToLower()" part inside the query resulted in a wrong behaviour.

string[] names = new string[] { "Ügyfél" };
string[] loweredNames = names.Select( n => n.ToLower() ).ToArray();
var query = collection.AsQueryable().Where( c => loweredNames.Any( n => c.Name.ToLower().Contains( n ) ) );

the query results as:

{ "$match" : { "$expr" : { "$anyElementTrue" : { "$map" : { "input" : ["ügyfél"], "as" : "n", "in" : { "$gte" : [{ "$indexOfCP" : [{ "$toLower" : "$Name" }, "$$n"] }, 0] } } } } } }

And there is no Model returning. - Incorrect, this one should've changed the letters to lowercase inside the database and since my searching word also started with a lowercase letter the result should've been one Model.

 
Using the ".ToLower()" on the incoming words but not inside the query:

string[] names = new string[] { "Ügyfél" };
string[] loweredNames = names.Select( n => n.ToLower() ).ToArray();
var query = collection.AsQueryable().Where( c => loweredNames.Any( n => c.Name.Contains( n ) ) ); 

results as:

{ "$match" : { "$expr" : { "$anyElementTrue" : { "$map" : { "input" : ["ügyfél"], "as" : "n", "in" : { "$gte" : [{ "$indexOfCP" : ["$Name", "$$n"] }, 0] } } } } } } 

There is no Model returning. - Correct.

 

Using the ".ToLower()" inside the query but not on the incoming words:

string[] names = new string[] { "Ügyfél" };
//string[] loweredNames = names.Select( n => n.ToLower() ).ToArray();
var query = collection.AsQueryable().Where( c => names.Any( n => c.Name.ToLower().Contains( n ) ) ); 

results as:

{ "$match" : { "$expr" : { "$anyElementTrue" : { "$map" : { "input" : ["Ügyfél"], "as" : "n", "in" : { "$gte" : [{ "$indexOfCP" : [{ "$toLower" : "$Name" }, "$$n"] }, 0] } } } } } } 

And there is one Model returning. -  Incorrect, this one should've changed the letters to lowercase inside the database and since my searching word started with an uppercase letter the result should've been an empty list.

Running the incorrect method but with only a part of the searching word - keeping the uppercase first letter:

string[] names = new string[] { "Fél" };
string[] loweredNames = names.Select( n => n.ToLower() ).ToArray();
var query = collection.AsQueryable().Where( c => loweredNames.Any( n => c.Name.ToLower().Contains( n ) ) );

results as:

{ "$match" : { "$expr" : { "$anyElementTrue" : { "$map" : { "input" : ["fél"], "as" : "n", "in" : { "$gte" : [{ "$indexOfCP" : [{ "$toLower" : "$Name" }, "$$n"] }, 0] } } } } } }

And there is one Model returning. - Correct.

 
Change the name of "Ügyfél" inside the database to "ügyfÉl" and running the following:

string[] names = new string[] { "fÉl" };
string[] loweredNames = names.Select( n => n.ToLower() ).ToArray();
var query = collection.AsQueryable().Where( c => loweredNames.Any( n => c.Name.ToLower().Contains( n ) ) ); 

results as:

{ "$match" : { "$expr" : { "$anyElementTrue" : { "$map" : { "input" : ["fél"], "as" : "n", "in" : { "$gte" : [{ "$indexOfCP" : [{ "$toLower" : "$Name" }, "$$n"] }, 0] } } } } } } 

There is no Model returning. - Incorrect, should've found one Model.

 

Now change "ügyfÉl" to "üGyfél" in the database and run the following:

string[] names = new string[] { "Ügyfél" };
string[] loweredNames = names.Select( n => n.ToLower() ).ToArray();
var query = collection.AsQueryable().Where( c => loweredNames.Any( n => c.Name.ToLower().Contains( n ) ) ); 

results as:

{ "$match" : { "$expr" : { "$anyElementTrue" : { "$map" : { "input" : ["ügyfél"], "as" : "n", "in" : { "$gte" : [{ "$indexOfCP" : [{ "$toLower" : "$Name" }, "$$n"] }, 0] } } } } } } 

And there is one Model returning. - Correct.

I'm assuming that the "$toLower" part inside the generated query doesn't actually do anything with accented characters inside the database.

Edit - The same happens by using ".ToUpper()".

Edit2 - After reading about this in the documentation I've found out that $toLower is only applied to ASCII characters, so I guess this isn't an issue at all. Then one last question about this: Are you going to implement UTF-8 characters for this feature some day?

 

Comment by Oleksandr Poliakov [ 17/Nov/23 ]

Hi 89.t.robert@gmail.com !

Could you please try to run your web app locally so we can validate if the problem related to the hosting environment or not? Also could you please confirm if your test console application uses the same database/collections as the kubernetes web app you have mentioned?

Comment by Takács Róbert [ 17/Nov/23 ]

Hi oleksandr.poliakov@mongodb.com !

Thank you for helping me out.

 

You're right, the queries are case sensitive and while in my production code I'm using them as non-sensitive, I removed those parts for the sake of simplicity. Unfortunately the problem is not hiding there.

Also I tried to reproduce it in a new database / collection with new data as you requested and there is one clue which I found out recently.

I totally forgot to mention that the whole system - MongoDB and services - are running in a dockerized kubernates platform. I'm still running some tests on the issue and I found out that if I'm connecting to this system through a simple C# Console.App:

string connectionString = "*****";
MongoClient client = new( connectionString );
var database = client.GetDatabase( "model-db" );
var collection = database.GetCollection<Model>( "models" );
//var model1 = new Model
//{
//    Name = "Ügyfél"
//};
//var model2 = new Model
//{
//    Name = "Felhasználó"
//};
//collection.InsertOne( model1 );
//collection.InsertOne( model1 );
string[] names = new string[] { "Új" };
var query = collection.AsQueryable().Where( c => names.Any( n => c.Name.Contains( n ) ) );
var result = query.ToArray(); 

The result will provide only 1 document, the correct one. Still when I'm calling through my app's APIs - either local or a web app in Azure - both are running in kubernates, the problem persists.

When I'm calling either app with Postman, the problem also persists. With our devops team we'll try to find out if it's behind some character set settings issue, and will come back with some results, also I'll provide details about our docker / kubernates settings.

 

Again thank you for your help!

 

Comment by Oleksandr Poliakov [ 16/Nov/23 ]

Hi 89.t.robert@gmail.com !

I cannot reproduce the issue. I've tried to use provided code with the following model:

 

private class Model
{
    public int Id { get; set; }
    public string Name { get; set; }
} 

And initialize the collection with following data:

 

 

new Model { Id = 1, Name = "Ügyfél" }
new Model { Id = 2, Name = "Other" } 

Everything works as expected. Results contains single object.

 

 

However your example for  Expression LINQ generates case insensitive regex, this makes me think that it might be somehow related to the case sensitivity. Could you please try to search by both: "Ügyfél" and "ügyfél"? The next step to investigate the problem would be try to reproduce the problem on the new database/collection. Also it would be helpful if you could provide some test data to reproduce the issue.

 

Thanks,

Oleksandr

Comment by PM Bot [ 13/Nov/23 ]

Hi 89.t.robert@gmail.com, thank you for reporting this issue! The team will look into it and get back to you soon.

Generated at Wed Feb 07 21:49:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.