-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: 1.5.1
-
Component/s: JSON & ExtJSON
-
None
-
None
-
None
-
None
-
None
-
None
-
None
RFC 8259 section 7 requires special handling of surrogate pairs like "\uD834\uDd1e":
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
`UnmarshalExtJSON` does not properly decode surrogate pairs. Instead it converts each to a Unicode replacement character.
Demo program:
package main
import (
"encoding/hex"
"fmt"
"go.mongodb.org/mongo-driver/bson"
)
func main() {
str := `{"a":"\uD834\uDd1e"}`
doc := bson.D{{"a", "\U0001D11E"}}
var buf bson.Raw
err := bson.UnmarshalExtJSON([]byte(str), true, &buf)
if err != nil {
panic(err)
}
fmt.Println("Unmarshaled from JSON: " + hex.EncodeToString(buf))
doc2, err := bson.Marshal(doc)
fmt.Println("Marshaled from bson.D: " + hex.EncodeToString(doc2))
}
Output:
Unmarshaled from JSON: 1300000002610007000000efbfbdefbfbd0000 Marshaled from bson.D: 1100000002610005000000f09d849e0000
Treatment if ill-formed surrogate pairs (e.g. only one) is often implementation defined. You can find cases to consider in this corpus: https://github.com/nst/JSONTestSuite