Uploaded image for project: 'Java Driver'
  1. Java Driver
  2. JAVA-4431

Driver allows inserting invalid UTF-8 strings

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: BSON

      Summary

      It is possible to create a String in Java containing byte sequences that when encoded by the Mongo Java Driver, are stored in a MongoDB string field as invalid UTF-8.

      BSON spec defines string type as UTF-8 which I think implies valid UTF-8 (but I could be wrong). When invalid UTF-8 is stored in a string field, it makes it impossible to create a text index on that field because MongoDB will throw an exception. Then it's also difficult to find/fix those fields.

      The behavior is different than the mongo shell or other drivers which validate the UTF-8 before persistence and use replacement characters such as '?' orΒ U+FFFD to ensure the DB only contains valid UTF-8 strings.

      Please provide the version of the driver. If applicable, please provide the MongoDB server version and topology (standalone, replica set, or sharded cluster).

      Replicable with Mongo Java Drivers: 4.3.x, 4.4.0

      MongoDB Server: 4.4.10 (crashes replica set trying to create text index on the field)

      MongoDB Server: 5.0.4 (fails with exception trying to create text index on the field)

      How to Reproduce

      Create String in Java using random bytes or truncating a String containing emojis. Insert into a collection. Try to create a text index.

      Β 

      import java.nio.charset.StandardCharsets;
      import java.util.List;
      import org.bson.Document;
      import com.mongodb.MongoClientSettings;
      import com.mongodb.MongoCommandException;
      import com.mongodb.MongoCredential;
      import com.mongodb.ServerAddress;
      import com.mongodb.client.MongoClient;
      import com.mongodb.client.MongoClients;
      import com.mongodb.client.MongoCollection;
      public class CreateInvalidUTF8 {
          
          public static void main(String[] args) {
              String fname = "Firstname";
              String lname = "🐎🐎🐎🐎";
              
              MongoClient mongoCli = MongoClients.create(MongoClientSettings.builder()
                      .applyToClusterSettings(builder ->
                         builder
                             .hosts(List.of(new ServerAddress(args[0]))))
                          .credential(MongoCredential.createCredential(args[1], "admin", args[2].toCharArray()))
                  .build());        MongoCollection<Document> coll = mongoCli.getDatabase("test_utf8").getCollection("foo");
              MongoCollection<Document> coll2 = mongoCli.getDatabase("test_utf8").getCollection("bar");
              
              Document d = new Document();
              // This is a common mistake in Java code that takes user input.
              // But it could also be a String created from any byte array 
              // containing random sequences that cannot be encoded
              d.put("name", fname + " " + lname.charAt(0));
              coll.insertOne(d);
              
              // fails with MongoCommandException "text contains invalid UTF-8"
              try {
                  coll.createIndex(new Document("name", "text"));
              } catch (MongoCommandException e) {
                  e.printStackTrace();
              }
              
              Document d2 = new Document();
              // String.getBytes(UTF_8) validates UTF-8 
              // and substitutes valid replacement character so the UTF-8 is valid 
              d2.put("name", new String((fname + " " + lname.charAt(0)).getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
              coll2.insertOne(d2);
              
              // succeeds
              coll2.createIndex(new Document("name", "text"));
              
          }
          
      }
      
      

      Β 

      Β 

      Additional Background

      The same text done via mongo shell succeeds and stores only valid UTF-8 in the DB string fields.

      db.baz.insertOne({name:('Firstname ' + '🐎🐎🐎🐎'.substring(0,1))})
      db.baz.createIndex({name:'text'})
      

            Assignee:
            Unassigned Unassigned
            Reporter:
            smachnowski@ixl.com Stephen Machnowski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: