Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.3.0
Affects Version/s: None
Component/s: None
Labels:
- indexv3

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-02-07, QE 2022-02-21, QE 2022-01-24
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

MongoDB uses the International Components for Unicode (ICU) library to implement collation. A collation may be associated with an index. In this case, any string values in the index are not stored raw, but rather using a binary encoding from ICU. These so-called "collation keys" are binary data which can be compared with one other using a simple binary comparison. In other words, the encoding of the collation key captures any language-specific sorting semantics. By storing these keys in the index, we ensure that the index is sorted according to language-specific sorting rules.

Unfortunately, these collation keys internally use BSON type 2, the type for UTF-8 strings, rather than type 5 (BinData). The ICU collation keys are also not guaranteed to be valid UTF-8, and they are routinely invalid when interpreted as UTF-8 in practice. This is all fine if the ICU collation keys stored inside BSON strings are only manipulated internally, however we have seen various scenarios in the past where we accidentally transmit them over the wire or write them to a log file: see related tickets ~~SERVER-46810~~, SERVER-26050, and ~~SERVER-50454~~.

This ticket tracks a similar scenario involving explain. Here is a simple example repro:

MongoDB Enterprise > db.c.drop()
MongoDB Enterprise > db.c.createIndex({a: 1}, {collation: {locale: "es"}})
MongoDB Enterprise > db.c.find({a: "A"}).collation({locale: "es"}).explain()
{
	"explainVersion" : "1",
	"queryPlanner" : {
		"namespace" : "test.c",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"a" : {
				"$eq" : "A"
			}
		},
		"collation" : {
			"locale" : "es",
			"caseLevel" : false,
			"caseFirst" : "off",
			"strength" : 3,
			"numericOrdering" : false,
			"alternate" : "non-ignorable",
			"maxVariable" : "punct",
			"normalization" : false,
			"backwards" : false,
			"version" : "57.1"
		},
		"queryHash" : "B44D1882",
		"planCacheKey" : "251DE71F",
		"maxIndexedOrSolutionsReached" : false,
		"maxIndexedAndSolutionsReached" : false,
		"maxScansToExplodeReached" : false,
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"a" : 1
				},
				"indexName" : "a_1",
				"collation" : {
					"locale" : "es",
					"caseLevel" : false,
					"caseFirst" : "off",
					"strength" : 3,
					"numericOrdering" : false,
					"alternate" : "non-ignorable",
					"maxVariable" : "punct",
					"normalization" : false,
					"backwards" : false,
					"version" : "57.1"
				},
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"a" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"a" : [
						"[\")\u0001\u0005\u0001�\", \")\u0001\u0005\u0001�\"]"  // <------- No good!
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
...
	"ok" : 1
}

Here we create an index {a: 1} using the "es" (Spanish) collation. Then we run a query with a collation-aware string predicate that is eligible to use this index. The explain dumps the execution plan, which involves an IXSCAN stage. Importantly, the "A" value against which the query is matching was translated to its corresponding ICU collation key when the query planner constructed the index bounds. These bounds are then reported in the explain output as invalid UTF-8 (hence the presence of the unicode replacement character U+FFFD).

Clients may have different defaults for how they handle invalid UTF-8 when they pull BSON data off the wire. Some clients may be strict, performing UTF-8 validation and will produce an error/exception if there is a ICU collation key which forms invalid UTF-8. Other clients may tolerate invalid UTF-8 or at least allow configuration between strict/loose modes when it comes to UTF-8 validation. In any case, the server really shouldn't generate invalid UTF-8 and send it to a client. (It would be another story if the application were storing UTF-8 and then subsequently attempting to retrieve it, but that is not the scenario here.)

In order to fix this, we should probably change the explain logic to encode collation keys as BinData, or potentially hex-encode the string, prior to transmission over the wire. This is slightly misleading because we will not be reporting the actual index bounds, but something slightly different. However, it's still wise to do this so that the server is not sending invalid BSON strings to clients.

is related to

SERVER-46810 Broken E11000 duplicate key error when unique index contains collation

Closed

SERVER-26050 Unique key violation for index with a non-simple collation has unclear error message

Backlog

SERVER-50454 Avoiding sending the "keyValue" field to drivers on duplicate key error

Closed

Assignee:: Andrii Dobroshynski (Inactive)
Reporter:: David Storch
Participants:: Andrii Dobroshynski, David Storch, Githook User
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Sep 28 2021 10:16:07 PM UTC
Updated:: Oct 29 2023 09:48:04 PM UTC
Resolved:: Feb 05 2022 07:02:03 PM UTC
Confidence Status Last Update:: 20/Oct/21 2:25 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates