Details
-
Improvement
-
Status: Closed
-
Major - P3
-
Resolution: Done
-
None
-
None
-
Major Change
-
Platform 6 07/17/15, Platform 8 08/28/15, Platform 7 08/10/15
Description
The text search tokenizer considers only the characters \\\f\v\t\r\n\'~`!@#$%^&*(-=+[]{}|;:"<>,. /? as token delimiters. As such, words adjacent to a curly quote (or an emdash, etc) won't be recognized as a stopword or indexed under the correct term. The set of recognized delimiters needs to be expanded and made unicode-aware.
Original ticket description:
I find that directional quotes can affect the ranking of search terms in the $text search.
This is a problem because our ranking is supposed to look at words, and should ignore symbols.
Steps to reproduce:
I have the following document*:
{
"_id" : {
"chapter" : "23",
"bookname" : "Ezekiel",
"verse" : "2"
},
"text" : "“Son of man, there were two women who were daughters of the same mother."
}
(note that the text character that appears at the beginning of the string in the "text" field is not a standard quote I get from shift-")
When I run the following query:
> db.bible.runCommand("text", { search : "Israel \"Son of man\""} )
the text gets a score of "score" : 0.5833333333333334
But, when I remove that weird quite and re-insert the document under a different _id, it gets a score of "score" : 1.1666666666666667
I would expect them both to get the same score, and for that score to be greater than 1, since it matches one of my search strings.
- Sorry for the heavily Christian text in the example, I found a random bible verse generating JSON API and just ran with it today.
Attachments
Issue Links
- depends on
-
SERVER-19557 Create Text Index v3
-
- Closed
-
- is documented by
-
DOCS-9557 Docs for SERVER-13099: Expand set of delimiters recognized by text search tokenizer
-
- Closed
-
- is related to
-
SERVER-17084 Full text search does not find text in typographic quotes
-
- Closed
-