Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-60298

Explain can include ICU collation keys encoded as strings, which can be invalid UTF-8



    • Type: Bug
    • Status: In Progress
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: None
    • Labels:
    • Operating System:
    • Sprint:
      QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-01-24
    • Case:


      MongoDB uses the International Components for Unicode (ICU) library to implement collation. A collation may be associated with an index. In this case, any string values in the index are not stored raw, but rather using a binary encoding from ICU. These so-called "collation keys" are binary data which can be compared with one other using a simple binary comparison. In other words, the encoding of the collation key captures any language-specific sorting semantics. By storing these keys in the index, we ensure that the index is sorted according to language-specific sorting rules.

      Unfortunately, these collation keys internally use BSON type 2, the type for UTF-8 strings, rather than type 5 (BinData). The ICU collation keys are also not guaranteed to be valid UTF-8, and they are routinely invalid when interpreted as UTF-8 in practice. This is all fine if the ICU collation keys stored inside BSON strings are only manipulated internally, however we have seen various scenarios in the past where we accidentally transmit them over the wire or write them to a log file: see related tickets SERVER-46810, SERVER-26050, and SERVER-50454.

      This ticket tracks a similar scenario involving explain. Here is a simple example repro:

      MongoDB Enterprise > db.c.drop()
      MongoDB Enterprise > db.c.createIndex({a: 1}, {collation: {locale: "es"}})
      MongoDB Enterprise > db.c.find({a: "A"}).collation({locale: "es"}).explain()
      	"explainVersion" : "1",
      	"queryPlanner" : {
      		"namespace" : "test.c",
      		"indexFilterSet" : false,
      		"parsedQuery" : {
      			"a" : {
      				"$eq" : "A"
      		"collation" : {
      			"locale" : "es",
      			"caseLevel" : false,
      			"caseFirst" : "off",
      			"strength" : 3,
      			"numericOrdering" : false,
      			"alternate" : "non-ignorable",
      			"maxVariable" : "punct",
      			"normalization" : false,
      			"backwards" : false,
      			"version" : "57.1"
      		"queryHash" : "B44D1882",
      		"planCacheKey" : "251DE71F",
      		"maxIndexedOrSolutionsReached" : false,
      		"maxIndexedAndSolutionsReached" : false,
      		"maxScansToExplodeReached" : false,
      		"winningPlan" : {
      			"stage" : "FETCH",
      			"inputStage" : {
      				"stage" : "IXSCAN",
      				"keyPattern" : {
      					"a" : 1
      				"indexName" : "a_1",
      				"collation" : {
      					"locale" : "es",
      					"caseLevel" : false,
      					"caseFirst" : "off",
      					"strength" : 3,
      					"numericOrdering" : false,
      					"alternate" : "non-ignorable",
      					"maxVariable" : "punct",
      					"normalization" : false,
      					"backwards" : false,
      					"version" : "57.1"
      				"isMultiKey" : false,
      				"multiKeyPaths" : {
      					"a" : [ ]
      				"isUnique" : false,
      				"isSparse" : false,
      				"isPartial" : false,
      				"indexVersion" : 2,
      				"direction" : "forward",
      				"indexBounds" : {
      					"a" : [
      						"[\")\u0001\u0005\u0001�\", \")\u0001\u0005\u0001�\"]"  // <------- No good!
      		"rejectedPlans" : [ ]
      	"ok" : 1

      Here we create an index {a: 1} using the "es" (Spanish) collation. Then we run a query with a collation-aware string predicate that is eligible to use this index. The explain dumps the execution plan, which involves an IXSCAN stage. Importantly, the "A" value against which the query is matching was translated to its corresponding ICU collation key when the query planner constructed the index bounds. These bounds are then reported in the explain output as invalid UTF-8 (hence the presence of the unicode replacement character U+FFFD).

      Clients may have different defaults for how they handle invalid UTF-8 when they pull BSON data off the wire. Some clients may be strict, performing UTF-8 validation and will produce an error/exception if there is a ICU collation key which forms invalid UTF-8. Other clients may tolerate invalid UTF-8 or at least allow configuration between strict/loose modes when it comes to UTF-8 validation. In any case, the server really shouldn't generate invalid UTF-8 and send it to a client. (It would be another story if the application were storing UTF-8 and then subsequently attempting to retrieve it, but that is not the scenario here.)

      In order to fix this, we should probably change the explain logic to encode collation keys as BinData, or potentially hex-encode the string, prior to transmission over the wire. This is slightly misleading because we will not be reporting the actual index bounds, but something slightly different. However, it's still wise to do this so that the server is not sending invalid BSON strings to clients.


          Issue Links



              andrii.dobroshynski Andrii Dobroshynski
              david.storch David Storch
              0 Vote for this issue
              9 Start watching this issue