[SERVER-62348] Text index creation fails with error "text contains invalid UTF-8" Created: 04/Jan/22 Updated: 27/Oct/23 Resolved: 14/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Jeffrey Yemin | Assignee: | Backlog - Query Execution |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
If a document is added to a collection with a field whose value contains an invalid UTF-8 string, creation of a text index on that field fails with the error "text contains invalid UTF-8". Creation of a normal index succeeds. (In the scenario I tested, the invalid UTF-8 is one where only the high surrogate of a surrogate pair is included in the string.) Given that the server allows invalid UTF-8 strings to be inserted into the database, we should consider whether the server should be more resilient to the presence of invalid UTF-8 strings when creating text indices. It would also be reasonable to close this as Works as Designed, as I imagine it's fairly rare for this to happen in practice. |
| Comments |
| Comment by Jennifer Peshansky (Inactive) [ 14/Jan/22 ] |
|
jeff.yemin, while we can't automatically validate all UTF-8, perhaps we could add an option to the validate command that would allow a user to manually check that all documents in their collection have valid UTF-8. We could also modify the error message to suggest the user run validate with this new option in order to find the documents that are at fault, so that they can create a text index; or perhaps this error message can be sent from the driver. If something like this would solve the underlying problem, then we can file a new ticket to clarify that request, and re-triage it then. |
| Comment by Kyle Suarez [ 14/Jan/22 ] |
|
jeff.yemin, after discussion at the triage meeting, this is the expected behavior. While the server will allow invalid UTF-8 to be inserted, the text index creation cannot meaningfully succeed because there would be documents that cannot be indexed properly. |
| Comment by Jeffrey Yemin [ 04/Jan/22 ] |
|
Please see description of |