[SERVER-13099] Expand set of delimiters recognized by text search tokenizer Created: 07/Mar/14  Updated: 08/Dec/16  Resolved: 11/Aug/15

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: None
Fix Version/s: 3.1.7

Type: Improvement Priority: Major - P3
Reporter: William Cross Assignee: Adam Chelminski (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-19557 Create Text Index v3 Closed
Documented
is documented by DOCS-9557 Docs for SERVER-13099: Expand set of ... Closed
Related
is related to SERVER-17084 Full text search does not find text i... Closed
Backwards Compatibility: Major Change
Sprint: Platform 6 07/17/15, Platform 8 08/28/15, Platform 7 08/10/15
Participants:

 Description   

The text search tokenizer considers only the characters \\\f\v\t\r\n\'~`!@#$%^&*(-=+[]{}|;:"<>,. /? as token delimiters. As such, words adjacent to a curly quote (or an emdash, etc) won't be recognized as a stopword or indexed under the correct term. The set of recognized delimiters needs to be expanded and made unicode-aware.

Original ticket description:

I find that directional quotes can affect the ranking of search terms in the $text search.

This is a problem because our ranking is supposed to look at words, and should ignore symbols.

Steps to reproduce:

I have the following document*:

{
  "_id" : {
    "chapter" : "23",
    "bookname" : "Ezekiel",
    "verse" : "2"
    },
  "text" : "“Son of man, there were two women who were daughters of the same mother."
}

(note that the text character that appears at the beginning of the string in the "text" field is not a standard quote I get from shift-")

When I run the following query:

> db.bible.runCommand("text", { search : "Israel \"Son of man\""} )

the text gets a score of "score" : 0.5833333333333334

But, when I remove that weird quite and re-insert the document under a different _id, it gets a score of "score" : 1.1666666666666667

I would expect them both to get the same score, and for that score to be greater than 1, since it matches one of my search strings.

  • Sorry for the heavily Christian text in the example, I found a random bible verse generating JSON API and just ran with it today.


 Comments   
Comment by Steve Renaker (Inactive) [ 08/Dec/16 ]

Which characters were added to the list of delimiters as a result of this ticket? I'm updating the documentation found here: https://docs.mongodb.com/manual/core/index-text/#tokenization-delimiters

Thanks!

Comment by Adam Chelminski (Inactive) [ 11/Aug/15 ]

This fix is integrated into text index v3, which will not work at all with previous versions of MongoDB.

Comment by Tobias Pfeiffer [ 02/Feb/15 ]

Is there any plan to address this issue?

Generated at Thu Feb 08 03:30:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.