Details
-
Bug
-
Resolution: Won't Fix
-
Minor - P4
-
None
-
None
-
(copied to CRM)
Description
Summary
It is possible to create a String in Java containing byte sequences that when encoded by the Mongo Java Driver, are stored in a MongoDB string field as invalid UTF-8.
BSON spec defines string type as UTF-8 which I think implies valid UTF-8 (but I could be wrong). When invalid UTF-8 is stored in a string field, it makes it impossible to create a text index on that field because MongoDB will throw an exception. Then it's also difficult to find/fix those fields.
The behavior is different than the mongo shell or other drivers which validate the UTF-8 before persistence and use replacement characters such as '?' orΒ U+FFFD to ensure the DB only contains valid UTF-8 strings.
Please provide the version of the driver. If applicable, please provide the MongoDB server version and topology (standalone, replica set, or sharded cluster).
Replicable with Mongo Java Drivers: 4.3.x, 4.4.0
MongoDB Server: 4.4.10 (crashes replica set trying to create text index on the field)
MongoDB Server: 5.0.4 (fails with exception trying to create text index on the field)
How to Reproduce
Create String in Java using random bytes or truncating a String containing emojis. Insert into a collection. Try to create a text index.
Β
import java.nio.charset.StandardCharsets; |
import java.util.List; |
import org.bson.Document; |
import com.mongodb.MongoClientSettings; |
import com.mongodb.MongoCommandException; |
import com.mongodb.MongoCredential; |
import com.mongodb.ServerAddress; |
import com.mongodb.client.MongoClient; |
import com.mongodb.client.MongoClients; |
import com.mongodb.client.MongoCollection; |
public class CreateInvalidUTF8 { |
|
public static void main(String[] args) { |
String fname = "Firstname"; |
String lname = "ππππ"; |
|
MongoClient mongoCli = MongoClients.create(MongoClientSettings.builder()
|
.applyToClusterSettings(builder ->
|
builder
|
.hosts(List.of(new ServerAddress(args[0])))) |
.credential(MongoCredential.createCredential(args[1], "admin", args[2].toCharArray())) |
.build()); MongoCollection<Document> coll = mongoCli.getDatabase("test_utf8").getCollection("foo"); |
MongoCollection<Document> coll2 = mongoCli.getDatabase("test_utf8").getCollection("bar"); |
|
Document d = new Document(); |
// This is a common mistake in Java code that takes user input. |
// But it could also be a String created from any byte array |
// containing random sequences that cannot be encoded |
d.put("name", fname + " " + lname.charAt(0)); |
coll.insertOne(d);
|
|
// fails with MongoCommandException "text contains invalid UTF-8" |
try { |
coll.createIndex(new Document("name", "text")); |
} catch (MongoCommandException e) { |
e.printStackTrace();
|
}
|
|
Document d2 = new Document(); |
// String.getBytes(UTF_8) validates UTF-8 |
// and substitutes valid replacement character so the UTF-8 is valid |
d2.put("name", new String((fname + " " + lname.charAt(0)).getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)); |
coll2.insertOne(d2);
|
|
// succeeds |
coll2.createIndex(new Document("name", "text")); |
|
}
|
|
}
|
|
Β
Β
Additional Background
The same text done via mongo shell succeeds and stores only valid UTF-8 in the DB string fields.
db.baz.insertOne({name:('Firstname ' + 'ππππ'.substring(0,1))}) |
db.baz.createIndex({name:'text'}) |
Attachments
Issue Links
- is related to
-
SERVER-62871 [4.4] Improve handling of text index creation in the presence of invalid UTF-8
-
- Closed
-
-
SERVER-62348 Text index creation fails with error "text contains invalid UTF-8"
-
- Closed
-