[JAVA-4431] Driver allows inserting invalid UTF-8 strings Created: 21/Dec/21  Updated: 10/Oct/22  Resolved: 10/Oct/22

Status: Closed
Project: Java Driver
Component/s: BSON
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Stephen Machnowski Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: external-user
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-62871 [4.4] Improve handling of text index ... Closed
is related to SERVER-62348 Text index creation fails with error ... Closed
Case:

 Description   

Summary

It is possible to create a String in Java containing byte sequences that when encoded by the Mongo Java Driver, are stored in a MongoDB string field as invalid UTF-8.

BSON spec defines string type as UTF-8 which I think implies valid UTF-8 (but I could be wrong). When invalid UTF-8 is stored in a string field, it makes it impossible to create a text index on that field because MongoDB will throw an exception. Then it's also difficult to find/fix those fields.

The behavior is different than the mongo shell or other drivers which validate the UTF-8 before persistence and use replacement characters such as '?' orΒ U+FFFD to ensure the DB only contains valid UTF-8 strings.

Please provide the version of the driver. If applicable, please provide the MongoDB server version and topology (standalone, replica set, or sharded cluster).

Replicable with Mongo Java Drivers: 4.3.x, 4.4.0

MongoDB Server: 4.4.10 (crashes replica set trying to create text index on the field)

MongoDB Server: 5.0.4 (fails with exception trying to create text index on the field)

How to Reproduce

Create String in Java using random bytes or truncating a String containing emojis. Insert into a collection. Try to create a text index.

Β 

import java.nio.charset.StandardCharsets;
import java.util.List;
import org.bson.Document;
import com.mongodb.MongoClientSettings;
import com.mongodb.MongoCommandException;
import com.mongodb.MongoCredential;
import com.mongodb.ServerAddress;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
public class CreateInvalidUTF8 {
    
    public static void main(String[] args) {
        String fname = "Firstname";
        String lname = "🐎🐎🐎🐎";
        
        MongoClient mongoCli = MongoClients.create(MongoClientSettings.builder()
                .applyToClusterSettings(builder ->
                   builder
                       .hosts(List.of(new ServerAddress(args[0]))))
                    .credential(MongoCredential.createCredential(args[1], "admin", args[2].toCharArray()))
            .build());        MongoCollection<Document> coll = mongoCli.getDatabase("test_utf8").getCollection("foo");
        MongoCollection<Document> coll2 = mongoCli.getDatabase("test_utf8").getCollection("bar");
        
        Document d = new Document();
        // This is a common mistake in Java code that takes user input.
        // But it could also be a String created from any byte array 
        // containing random sequences that cannot be encoded
        d.put("name", fname + " " + lname.charAt(0));
        coll.insertOne(d);
        
        // fails with MongoCommandException "text contains invalid UTF-8"
        try {
            coll.createIndex(new Document("name", "text"));
        } catch (MongoCommandException e) {
            e.printStackTrace();
        }
        
        Document d2 = new Document();
        // String.getBytes(UTF_8) validates UTF-8 
        // and substitutes valid replacement character so the UTF-8 is valid 
        d2.put("name", new String((fname + " " + lname.charAt(0)).getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
        coll2.insertOne(d2);
        
        // succeeds
        coll2.createIndex(new Document("name", "text"));
        
    }
    
}

Β 

Β 

Additional Background

The same text done via mongo shell succeeds and stores only valid UTF-8 in the DB string fields.

db.baz.insertOne({name:('Firstname ' + '🐎🐎🐎🐎'.substring(0,1))})
db.baz.createIndex({name:'text'})



 Comments   
Comment by Jeffrey Yemin [ 10/Oct/22 ]

Since it's not considered a server error to insert invalid UTF-8, it's not clear that drivers should consider it an error.Β  And it could also be viewed as a backwards-breaking change, as there could be legitimate reasons why an application would be doing this.

Closing as Won't Fix.

Comment by Esha Bhargava [ 04/Jan/22 ]

This is something we can fix, but there is a cost that we have to make sure it's acceptable. For now we've decided to put it in the backlog.

Comment by Jeffrey Yemin [ 04/Jan/22 ]

Maybe:

    static boolean isOrphanedSurrogate(int cp) {
        if (cp >= Character.MIN_HIGH_SURROGATE && cp <= Character.MAX_HIGH_SURROGATE) {
            return true;
        }
 
        if (cp >= Character.MIN_LOW_SURROGATE && cp <=Character.MAX_LOW_SURROGATE) {
            return true;
        }
        return false;
    }

Comment by Jeffrey Yemin [ 04/Jan/22 ]

The fix would be in this method.

Perhaps something like:

if (Character.getType(c) == Character.SURROGATE) {
   c = REPLACEMENT_CHARACTER;
}

but that's not right, since it evaluates to true for a proper surrogate pair as well.

By the way: the reason that method doesn't just call String#getBytes is to avoid an unnecessary array allocation and copy.

Comment by Esha Bhargava [ 21/Dec/21 ]

smachnowski@ixl.com Thank you for filing the issue! We'll look into it and get back to you soon.

Generated at Thu Feb 08 09:02:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.