[SERVER-24832] Project array by text search predicate Created: 28/Jun/16  Updated: 27/Dec/23

Status: Backlog
Project: Core Server
Component/s: Querying, Text Search
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Cássio Martini Martins Pereira Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 0
Labels: qi-text-search
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-17648 Implement matcher for text predicates Backlog
Assigned Teams:
Query Integration
Participants:

 Description   

Suppose I create the following db/collection:

> use articles
> db.stores.insert( [ 
             {_id : 1, arr : ['abc xyz', 'def']}, 
             {_id : 2, arr : ['jadskf', 'ljh abc']}])
> db.stores.createIndex({"arr" : "text"})

I want to do text search over the arrays arr and obtain only the array elements that match the search.

For instance,

> db.stores.find({$text : {$search : "abc"}})
{ "_id" : 1, "arr" : [ "abc xyz", "def" ] }
{ "_id" : 2, "arr" : [ "jadskf", "ljh abc" ] }

Ideally I'd like to only obtain the first element of the arr in the first document: abc xyz and only the second element of the arr in the second document: ljh abc, and also the _id of the matched documents.

Note that what I want is not a simple projection of the arr element, but of the elements of the arr where the match occurred.



 Comments   
Comment by Cássio Martini Martins Pereira [ 05/Jul/16 ]

Thanks for the update ramon.fernandez. Will do.

Comment by Ramon Fernandez Marina [ 05/Jul/16 ]

Thanks for the additional information. The Query team has added this to their backlog for consideration in the future. Please continue to watch the ticket for updates.

Comment by Cássio Martini Martins Pereira [ 29/Jun/16 ]

Hi, thanks for looking into it ramon.fernandez.

The problem with regexes is that you're going to have to hardcode them.

For instance, suppose the following new example:

use articles
db.stores.insert( [{_id : 1, arr : ['i watched a movie last night', 'def']}, {_id : 2, arr : ['i went fishing last thursday', 'ljh abc']}])
db.stores.createIndex({"arr" : "text"})

FTS can find the entries with the following queries:

> db.stores.find({$text : {$search : "fish"}})
{ "_id" : 2, "arr" : [ "i went fishing last thursday", "ljh abc" ] }
> db.stores.find({$text : {$search : "watch"}})
{ "_id" : 1, "arr" : [ "i watched a movie last night", "def" ] }

How would you write a generic regular expression that handles both cases?

Notice that for FTS stemming is going on on the background.

Hardcoding fish* and watch* is not allowed. Imagine the input is coming from a search box for a user.

Comment by Ramon Fernandez Marina [ 29/Jun/16 ]

I am sending this feature request to the Query team for evaluation. Note that this feature introduces new semantics for multikey indices, so even if it's accepted it may require a considerable effort. I think it would help your case if you were able to produce an example where using regular expressions is not sufficient.

Thanks,
Ramón.

Comment by Cássio Martini Martins Pereira [ 29/Jun/16 ]

Hi ramon.fernandez,

That is a possibility.

I also got a nice "trick" from a stackoverflow question I posted: http://stackoverflow.com/questions/38085710/return-only-the-array-element-on-which-a-text-search-matched

db.stores.find({
    $text : {
        $search : "abc"
    }
}, {
    arr : {
        $elemMatch : {
            $regex : /abc/i
        }
    }
})

But your proposed solution and the one I got from stack overflow (listed above) use regexs. While that does solve my simple toy case, regexs do not have the same semantics of full text search in Mongo. FTS uses stop words, stemming, which are not present in regexes.

If you can provide me with an example that solves the problem using the aggregation framework (or map reduce) while using FTS I'd be happy the functionality already exists.

PS: Thanks for the quick response!

Comment by Ramon Fernandez Marina [ 29/Jun/16 ]

cmmp, perhaps the aggregation framework can meet your needs:

> db.foo.find()
{ "_id" : 1, "arr" : [ "abc xyz", "def" ] }
{ "_id" : 2, "arr" : [ "jadskf", "ljh abc" ] }
{ "_id" : 3, "arr" : [ "foo1", "foo2 abc", "test", "abc foo3" ] }
>
> db.foo.aggregate([{$unwind : "$arr"}, {$match : {arr : /abc/}}, {$group : {_id: '$_id', arr: {$push: "$arr"}}}, {$sort : {_id : 1}}])
{ "_id" : 1, "arr" : [ "abc xyz" ] }
{ "_id" : 2, "arr" : [ "ljh abc" ] }
{ "_id" : 3, "arr" : [ "foo2 abc", "abc foo3" ] }

Is this the functionality you're looking for?

Generated at Thu Feb 08 04:07:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.