Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.1.7
Affects Version/s: None
Component/s: Text Search
Labels:
None

Backwards Compatibility:
Major Change
Sprint:
Platform 6 07/17/15, Platform 8 08/28/15, Platform 7 08/10/15
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The text search tokenizer considers only the characters \\\f\v\t\r\n\'~`!@#$%^&*(-=+[]{}|;:"<>,. /? as token delimiters. As such, words adjacent to a curly quote (or an emdash, etc) won't be recognized as a stopword or indexed under the correct term. The set of recognized delimiters needs to be expanded and made unicode-aware.

Original ticket description:

I find that directional quotes can affect the ranking of search terms in the $text search.

This is a problem because our ranking is supposed to look at words, and should ignore symbols.

Steps to reproduce:

I have the following document*:
{
  "_id" : {
    "chapter" : "23",
    "bookname" : "Ezekiel",
    "verse" : "2"
    },
  "text" : "“Son of man, there were two women who were daughters of the same mother."
}
(note that the text character that appears at the beginning of the string in the "text" field is not a standard quote I get from shift-")

When I run the following query:
> db.bible.runCommand("text", { search : "Israel \"Son of man\""} )
the text gets a score of "score" : 0.5833333333333334

But, when I remove that weird quite and re-insert the document under a different _id, it gets a score of "score" : 1.1666666666666667

I would expect them both to get the same score, and for that score to be greater than 1, since it matches one of my search strings.

Sorry for the heavily Christian text in the example, I found a random bible verse generating JSON API and just ran with it today.

depends on

SERVER-19557 Create Text Index v3

Closed

is related to

SERVER-17084 Full text search does not find text in typographic quotes

Closed

Assignee:: Adam Chelminski (Inactive)
Reporter:: William Cross (Inactive)
Participants:: Adam Chelminski, Steve Renaker, Tobias Pfeiffer, William Cross
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Mar 07 2014 08:02:10 PM UTC
Updated:: Dec 08 2016 11:41:12 PM UTC
Resolved:: Aug 11 2015 09:21:05 PM UTC
Confidence Status Last Update:: 17/Jul/15 12:04 AM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates