[SERVER-27392] text index with multi-key prefix is not rejected Created: 13/Dec/16  Updated: 17/Jan/24  Resolved: 01/Feb/17

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 3.4.0
Fix Version/s: 3.5.3

Type: Bug Priority: Major - P3
Reporter: Cyril Auburtin Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-34168 Compound textual index with nested fi... Closed
Related
related to SERVER-84114 Exceptions thrown while generating in... Open
is related to SERVER-32720 Field xxx of text index contains an a... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
Issue summary for SERVER-27392

Issue Description:

In certain situations, compound text indexes in MongoDB 3.4 may have incorrectly indexed documents where either non-text prefix or suffix fields are multikey. Documents containing an array in fields referenced by a compound text index will not be returned by queries which use this index.

Expected Behavior: Documents containing multikey values for the prefix or suffix fields should have been rejected by the index, causing the index build to fail or rejecting certain inserts to the collection.

Indexes Affected

This only affects compound text indexes where the prefix or suffix fields are non-text.

For example:

collection.createIndex({ "a.b" : 1, "c" : "text" })

Other compound text indexes are unaffected.

Example:

// Only occurrs in MongoDB 3.4
db.foo.drop()
db.foo.createIndex({ "a.b" : 1, "a.c" : "text" })
 
// Should be rejected due to multikey index prefix but is accepted
db.foo.insert({ "a" : [ { "b" : 1, "c": "foo" }, { "b" : 2, "c" : "bar" } ] })
 
// As a result, the indexed value for 
db.foo.find({ "a.b" : 1, "$text" : { "$search" : "foo" } })
// nothing is returned

Affected Versions

  • 3.4.0 - 3.4.24 - Compound text indexes allowed invalid index keys.
  • 3.6.0 and later - Compound text indexes no longer allow invalid index keys. However, indexes already containing invalid index keys would continue to be affected until those index keys were removed or the index was rebuilt. See the Remediation section below.

Diagnosis

Run the validate() command on collections with compound text indexes. The command will return the following error if impacted:

  • text index contains an array in document

Remediation

If validate() reports that a "text index contains an array", this issue can be remediated by re-building the affected indexes after removing or updating any documents containing data that would be incompatible with the index. The following steps may help to guide this effort.

1. Use $type to query the collection for documents containing multi-key values in the index prefix. For example, for the index { "a.b" : 1, "c" : "text" }, issuing the following query may identify all documents with multi-key values for "a.b":

	db.collection.find({ "$or" : [ { "a": { "$type" : 4 } }, { "a.b" : { "$type" : 4 } } ] } ) 

You may also want to consider creating partial indexes to improve the performance of these operations. For example:

	db.collection.createIndex({ "a" : 1 }, { "partialFilterExpression" : { "a" : { "$type" : 4 } } } ) 
	db.collection.createIndex({ "a.b" : 1 }, { "partialFilterExpression" : { "a" : { "$type" : 4 } } } )

2. Update or remove these documents to remove the invalid multi-key values from the collection.

3. Drop and rebuild the compound text index.


Previous description:

> db.a.find()
> db.a.createIndex({'x.name':'text'})
{
        "createdCollectionAutomatically" : true,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.a.insert({i:1, x:[{name:'pet', language:'en'}, {name:'animal domestique', language:'fr'}]})
WriteResult({ "nInserted" : 1 })
> db.a.insert({i:1, x:[{name:'fart', language:'en'}, {name:'pet', language:'fr'}]})
WriteResult({ "nInserted" : 1 })
> db.a.insert({i:1, x:[{name:'wax', language:'en'}, {name:'farter', language:'fr'}]})
WriteResult({ "nInserted" : 1 })
> db.a.find({$text:{$search:'fart', $language:'fr'}, 'x.language':'fr'})
{ "_id" : ObjectId("584868134fb941395745c6bb"), "i" : 1, "x" : [ { "name" : "wax", "language" : "en" }, { "name" : "farter", "language" : "fr" } ] }
{ "_id" : ObjectId("584867ec4fb941395745c6ba"), "i" : 1, "x" : [ { "name" : "fart", "language" : "en" }, { "name" : "pet", "language" : "fr" } ] }
> db.a.dropIndexes()
> db.a.createIndex({'x.language':1, 'x.name': 'text'})
> db.a.find({'x.language':'fr', $text:{$search: 'fart', $language:'fr'}})
> db.a.find({'x.language':'fr', $text:{$search: 'fart', $language:'en'}})
> db.a.find({'x.language':'en', $text:{$search: 'fart', $language:'en'}})
// empty results

Sprint: Query 2017-01-23, Query 2017-02-13
Participants:

 Description   

When doing a AND query between text search and another field, the other field seems ignored

Below, there should be only one result



 Comments   
Comment by Githook User [ 01/Feb/17 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-27392 fix bug in FTS index's check for indexed arrays
Branch: master
https://github.com/mongodb/mongo/commit/521a57162820b3b1100d20408f5ad0cd89fdc80e

Comment by Cyril Auburtin [ 13/Dec/16 ]

Thanks Thomas, great answers,

This way of embedding multiple translations in the same document is really practical, and efficient, but you can't avoid this (rare) case of false-friends matches like shown in first example. It would be great if Text search could give which language matched and be able to $and/filter it

I could store simple documents (not nested), then I can avoid false-friends, but it's not really good for performance I think, because you need to link those small multilanguage documents to the main document, then do _id: {$in: allIdsFound} + other queries in the main collection, without being able to .limit or .skip for finding this ids, and then having to possibly repopulated after...

Anyway, this is getting far from the issue to raised, just wanting to explain the context

Comment by Kelsey Schubert [ 13/Dec/16 ]

Hi caub,

Thank you for the report. The behavior you are observing with the first index is expected.

$language:'fr' changes the tokenizer on the text search, and x.language':'fr' matches against any subdocument in the array containing this equality. Consequently it is is expected that

{ "_id" : ObjectId("584867ec4fb941395745c6ba"), "i" : 1, "x" : [ { "name" : "fart", "language" : "en" }, { "name" : "pet", "language" : "fr" } ] }

is returned since the first document in the array contains the name fart and the second document in the array has language set to "fr". For more information, please review our documentation.

The second example you have presented is a bug, and I will be modifying this ticket to highlight this issue. The index

{'x.language':1, 'x.name': 'text'}

does not support multikey values on x.language, and MongoDB should reject the index in this case. See the example below for its expected behavior:

> db.c.insert({a:[1.2],b:'a'})
WriteResult({ "nInserted" : 1 })
> db.c.createIndex({'a':1,'b':'text'})
{
	"ok" : 0,
	"errmsg" : "cannot have a multi-key as a prefix to a text index",
	"code" : 16675,
	"codeName" : "Location16675"
}

Please continue to watch for updates regarding the fix for this issue.

Kind regards,
Kelsey

Comment by Cyril Auburtin [ 13/Dec/16 ]

I've also tried:

> db.a.dropIndexes()
> db.a.createIndex({'x.language':1, 'x.name': 'text'})
> db.a.find({'x.language':'fr', $text:{$search: 'fart', $language:'fr'}})
> db.a.find({'x.language':'fr', $text:{$search: 'fart', $language:'en'}})
> db.a.find({'x.language':'en', $text:{$search: 'fart', $language:'en'}})
// empty results

Comment by Cyril Auburtin [ 13/Dec/16 ]

steps to reproduce:

> db.a.find()
> db.a.createIndex({'x.name':'text'})
{
        "createdCollectionAutomatically" : true,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.a.insert({i:1, x:[{name:'pet', language:'en'}, {name:'animal domestique', language:'fr'}]})
WriteResult({ "nInserted" : 1 })
> db.a.insert({i:1, x:[{name:'fart', language:'en'}, {name:'pet', language:'fr'}]})
WriteResult({ "nInserted" : 1 })
> db.a.insert({i:1, x:[{name:'wax', language:'en'}, {name:'farter', language:'fr'}]})
WriteResult({ "nInserted" : 1 })
> db.a.find({$text:{$search:'fart', $language:'fr'}, 'x.language':'fr'})
{ "_id" : ObjectId("584868134fb941395745c6bb"), "i" : 1, "x" : [ { "name" : "wax", "language" : "en" }, { "name" : "farter", "language" : "fr" } ] }
{ "_id" : ObjectId("584867ec4fb941395745c6ba"), "i" : 1, "x" : [ { "name" : "fart", "language" : "en" }, { "name" : "pet", "language" : "fr" } ] }

Generated at Thu Feb 08 04:15:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.