[SERVER-49046] Provide the ability to perform a collation aware search only on the fields on which the index is defined Created: 23/Jun/20  Updated: 16/Jul/20  Resolved: 16/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Harshad Dhavale Assignee: David Storch
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-25954 Support more granular collation speci... Backlog
Sprint: Query 2020-07-27
Participants:
Case:

 Description   

If we have an index with a certain collation defined, then a search performed using that index will use the collation not only for that field but for other fields as well.

Consider a test collection with its corresponding indexes like this:

> db.coll_insens_search_test.find()
{ "_id" : 1, "A" : "goat", "B" : "core" }
{ "_id" : 2, "A" : "goat", "B" : "CORE" }
{ "_id" : 3, "A" : "GoAt", "B" : "core" }
{ "_id" : 4, "A" : "GOAT", "B" : "CORE" }
{ "_id" : 5, "A" : "GOat", "B" : "core" }

> db.coll_insens_search_test.getIndexes()
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "test.coll_insens_search_test"
	},
	{
		"v" : 2,
		"key" : {
			"B" : 1
		},
		"name" : "HD_CaseInsensitiveIndex",
		"ns" : "test.coll_insens_search_test",
		"collation" : {
			"locale" : "en_US",
			"caseLevel" : false,
			"caseFirst" : "off",
			"strength" : 2,
			"numericOrdering" : false,
			"alternate" : "non-ignorable",
			"maxVariable" : "punct",
			"normalization" : false,
			"backwards" : false,
			"version" : "57.1"
		}
	}
]

The case-insensitive index above is only on field B and not on field A. However, the following query yields all 5 documents:

> db.coll_insens_search_test.find({ A: "GOat", B:"core" }).collation({ locale: "en_US", strength: 2 })
{ "_id" : 1, "A" : "goat", "B" : "core" }
{ "_id" : 2, "A" : "goat", "B" : "CORE" }
{ "_id" : 3, "A" : "GoAt", "B" : "core" }
{ "_id" : 4, "A" : "GOAT", "B" : "CORE" }
{ "_id" : 5, "A" : "GOat", "B" : "core" }

In other words, we see that the case-insensitive search is being performed not only for B on which the index is defined, but also for A which is not covered in the index (it is debatable whether this is the correct behavior). Otherwise it would've only returned the last document with "_id" : 5.

 
The winningPlan for the above query looks like this:

"winningPlan" : {
			"stage" : "FETCH",
			"filter" : {
				"A" : {
					"$eq" : "GOat"
				}
			},
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"B" : 1
				},
				"indexName" : "HD_CaseInsensitiveIndex",
				"collation" : {
					"locale" : "en_US",
					"caseLevel" : false,
					"caseFirst" : "off",
					"strength" : 2,
					"numericOrdering" : false,
					"alternate" : "non-ignorable",
					"maxVariable" : "punct",
					"normalization" : false,
					"backwards" : false,
					"version" : "57.1"
				},
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"B" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"B" : [
						"[\"-EK1\u0001\b\", \"-EK1\u0001\b\"]"
					]
				}
			}
		}

 

Without using the collation, it yields only the 5th document as expected, as shown here:

> db.coll_insens_search_test.find({ A: "GOat", B:"core" })
{ "_id" : 5, "A" : "GOat", "B" : "core" }

 

This is a feature request to provide the ability to perform a collation aware search only on the fields on which the index is defined, and not expand it to other fields which are not covered in the index.



 Comments   
Comment by David Storch [ 16/Jul/20 ]

The described behavior is working as designed. In today's MQL the collation can only be applied to the query as a whole. All string comparisons that are done to compute the query results should be collation-aware. The presence of any indexes with collations does not change this property; the set of indexes should allow the system to accelerate the query, but does not change the query's meaning.

SERVER-25954 would allow collation to be specified at a more granular level. In particular, one of the ideas tracked by this ticket is to allow each individual comparison expression ($gt, $eq, etc.) to specify its own collation. I believe per-expression collation is exactly what this ticket is asking for, so I'm closing it as a duplicate of SERVER-25954.

Comment by David Storch [ 25/Jun/20 ]

At a glance, this appears to be a duplicate of SERVER-25954.

Generated at Thu Feb 08 05:18:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.