[SERVER-62348] Text index creation fails with error "text contains invalid UTF-8" Created: 04/Jan/22  Updated: 27/Oct/23  Resolved: 14/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Jeffrey Yemin Assignee: Backlog - Query Execution
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Related
related to SERVER-62871 [4.4] Improve handling of text index ... Closed
related to JAVA-4431 Driver allows inserting invalid UTF-8... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

If a document is added to a collection with a field whose value contains an invalid UTF-8 string, creation of a text index on that field fails with the error "text contains invalid UTF-8". Creation of a normal index succeeds.

(In the scenario I tested, the invalid UTF-8 is one where only the high surrogate of a surrogate pair is included in the string.)

Given that the server allows invalid UTF-8 strings to be inserted into the database, we should consider whether the server should be more resilient to the presence of invalid UTF-8 strings when creating text indices.

It would also be reasonable to close this as Works as Designed, as I imagine it's fairly rare for this to happen in practice.



 Comments   
Comment by Jennifer Peshansky (Inactive) [ 14/Jan/22 ]

jeff.yemin, while we can't automatically validate all UTF-8, perhaps we could add an option to the validate command that would allow a user to manually check that all documents in their collection have valid UTF-8. We could also modify the error message to suggest the user run validate with this new option in order to find the documents that are at fault, so that they can create a text index; or perhaps this error message can be sent from the driver. If something like this would solve the underlying problem, then we can file a new ticket to clarify that request, and re-triage it then.

Comment by Kyle Suarez [ 14/Jan/22 ]

jeff.yemin, after discussion at the triage meeting, this is the expected behavior. While the server will allow invalid UTF-8 to be inserted, the text index creation cannot meaningfully succeed because there would be documents that cannot be indexed properly.

Comment by Jeffrey Yemin [ 04/Jan/22 ]

Please see description of JAVA-4431 for more context from the original reporter.

Generated at Thu Feb 08 05:54:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.