Details
-
Bug
-
Resolution: Fixed
-
Major - P3
-
1.5.1
Description
RFC 8259 section 7 requires special handling of surrogate pairs like "\uD834\uDd1e":
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
`UnmarshalExtJSON` does not properly decode surrogate pairs. Instead it converts each to a Unicode replacement character.
Demo program:
package main
|
|
|
import (
|
"encoding/hex"
|
"fmt"
|
|
|
"go.mongodb.org/mongo-driver/bson"
|
)
|
|
|
func main() {
|
str := `{"a":"\uD834\uDd1e"}`
|
doc := bson.D{{"a", "\U0001D11E"}}
|
|
|
var buf bson.Raw
|
err := bson.UnmarshalExtJSON([]byte(str), true, &buf)
|
if err != nil {
|
panic(err)
|
}
|
fmt.Println("Unmarshaled from JSON: " + hex.EncodeToString(buf))
|
|
|
doc2, err := bson.Marshal(doc)
|
fmt.Println("Marshaled from bson.D: " + hex.EncodeToString(doc2))
|
}
|
Output:
Unmarshaled from JSON: 1300000002610007000000efbfbdefbfbd0000
|
Marshaled from bson.D: 1100000002610005000000f09d849e0000
|
Treatment if ill-formed surrogate pairs (e.g. only one) is often implementation defined. You can find cases to consider in this corpus: https://github.com/nst/JSONTestSuite