[CSHARP-996] ERROR Unable to translate bytes at index N from specified code page to Unicode. Created: 16/Jun/14 Updated: 04/Apr/15 Resolved: 04/Apr/15 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | API, BSON |
| Affects Version/s: | 1.8.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Maxim Sidorenko | Assignee: | Robert Stam |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Mongo DB version : 2.6.1 |
||
| Description |
|
We have faced with following exception trying to read from DB, and we can not reproduce it once more. All read\write operations to Mongo DB are perfomed with C# MongoDB driver 1.8.3, Mongo DB version : 2.6.1. ERROR Unable to translate bytes [AB] at index 106 from specified code page to Unicode. |
| Comments |
| Comment by Robert Stam [ 16/Jun/14 ] |
|
String data is stored in MongoDB as UTF8 strings. This exception occurs when the driver is converting a UTF8 encoded string to Unicode (the in-memory representation used by a C# string). Somehow a document with invalid UTF8 has been introduced to the database. If all documents were inserted using the C# driver this should not be possible, since the C# driver uses the same UTF8 encoder to encode the stored UTF8 strings as to decode them. The last time I encountered this issue the problematic UTF8 data had been inserted into the database using mongoimport. If necessary you can configure the C# driver to be more lenient in its UTF8 decoding. You can do that by setting the ReadEncoding property in either MongoClientSettings, MongoDatabaseSettings or MongoCollectionSettings. By default the C# driver uses a strict encoder (which is why you are getting this exception when the UTF8 data is invalid). You can configure a lenient encoder like this: var throwOnInvalidBytes = false; A lenient encoding will do the best it can to decode UTF8, even if it is invalid. Any invalid UTF8 bytes are decoded to a special Unicode character that represents an "invalid character". Keep in mind that if you use a lenient encoding and you read a document containing invalid UTF8 and then save it back to the database the UTF8 encoding will change slightly (from the original invalid UTF8 data to a valid UTF8 representation of the "invalid character" Unicode code point). |