[GODRIVER-1947] UnmarshalExtJSON doesn't handle escaped surrogate pairs Created: 04/Apr/21  Updated: 28/Oct/23  Resolved: 04/May/21

Status: Closed
Project: Go Driver
Component/s: JSON & ExtJSON
Affects Version/s: 1.5.1
Fix Version/s: 1.5.2

Type: Bug Priority: Major - P3
Reporter: David Golden Assignee: Matt Dale
Resolution: Fixed Votes: 0
Labels: matt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

RFC 8259 section 7 requires special handling of surrogate pairs like "\uD834\uDd1e":

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

`UnmarshalExtJSON` does not properly decode surrogate pairs. Instead it converts each to a Unicode replacement character.

Demo program:

package main
 
import (
	"encoding/hex"
	"fmt"
 
	"go.mongodb.org/mongo-driver/bson"
)
 
func main() {
	str := `{"a":"\uD834\uDd1e"}`
	doc := bson.D{{"a", "\U0001D11E"}}
 
	var buf bson.Raw
	err := bson.UnmarshalExtJSON([]byte(str), true, &buf)
	if err != nil {
		panic(err)
	}
	fmt.Println("Unmarshaled from JSON: " + hex.EncodeToString(buf))
 
	doc2, err := bson.Marshal(doc)
	fmt.Println("Marshaled from bson.D: " + hex.EncodeToString(doc2))
}

Output:

Unmarshaled from JSON: 1300000002610007000000efbfbdefbfbd0000
Marshaled from bson.D: 1100000002610005000000f09d849e0000

Treatment if ill-formed surrogate pairs (e.g. only one) is often implementation defined. You can find cases to consider in this corpus: https://github.com/nst/JSONTestSuite



 Comments   
Comment by Githook User [ 04/May/21 ]

Author:

{'name': 'Matt Dale', 'email': '9760375+matthewdale@users.noreply.github.com', 'username': 'matthewdale'}

Message: GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. (#649)

  • GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON.
  • Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test.
  • Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types.
  • Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling.
Comment by Githook User [ 03/May/21 ]

Author:

{'name': 'Matt Dale', 'email': '9760375+matthewdale@users.noreply.github.com', 'username': 'matthewdale'}

Message: GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON. (#649)

  • GODRIVER-1947 Unmarshal unicode surrogate pairs correctly in UnmarshalExtJSON.
  • Correct err handling in jsonScanner.ScanString and remove unused field in UnmarshalExtJSON test.
  • Explicitly write unicode.ReplacementChar for invalid surrogate and simplify test types.
  • Add tests for high surrogate followed by non-Unicode escape sequence and 4-byte UTF-8 extJSON marshaling.
Comment by Matt Dale [ 26/Apr/21 ]

PR: https://github.com/mongodb/mongo-go-driver/pull/649

Generated at Thu Feb 08 08:37:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.