-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 1.5.1
-
Component/s: JSON & ExtJSON
RFC 8259 section 7 requires special handling of surrogate pairs like "\uD834\uDd1e":
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
`UnmarshalExtJSON` does not properly decode surrogate pairs. Instead it converts each to a Unicode replacement character.
Demo program:
package main import ( "encoding/hex" "fmt" "go.mongodb.org/mongo-driver/bson" ) func main() { str := `{"a":"\uD834\uDd1e"}` doc := bson.D{{"a", "\U0001D11E"}} var buf bson.Raw err := bson.UnmarshalExtJSON([]byte(str), true, &buf) if err != nil { panic(err) } fmt.Println("Unmarshaled from JSON: " + hex.EncodeToString(buf)) doc2, err := bson.Marshal(doc) fmt.Println("Marshaled from bson.D: " + hex.EncodeToString(doc2)) }
Output:
Unmarshaled from JSON: 1300000002610007000000efbfbdefbfbd0000 Marshaled from bson.D: 1100000002610005000000f09d849e0000
Treatment if ill-formed surrogate pairs (e.g. only one) is often implementation defined. You can find cases to consider in this corpus: https://github.com/nst/JSONTestSuite