Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13099

Expand set of delimiters recognized by text search tokenizer

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.1.7
    • Affects Version/s: None
    • Component/s: Text Search
    • None
    • Major Change
    • Platform 6 07/17/15, Platform 8 08/28/15, Platform 7 08/10/15

      The text search tokenizer considers only the characters \\\f\v\t\r\n\'~`!@#$%^&*(-=+[]{}|;:"<>,. /? as token delimiters. As such, words adjacent to a curly quote (or an emdash, etc) won't be recognized as a stopword or indexed under the correct term. The set of recognized delimiters needs to be expanded and made unicode-aware.

      Original ticket description:

      I find that directional quotes can affect the ranking of search terms in the $text search.

      This is a problem because our ranking is supposed to look at words, and should ignore symbols.

      Steps to reproduce:

      I have the following document*:

        "_id" : {
          "chapter" : "23",
          "bookname" : "Ezekiel",
          "verse" : "2"
        "text" : "“Son of man, there were two women who were daughters of the same mother."

      (note that the text character that appears at the beginning of the string in the "text" field is not a standard quote I get from shift-")

      When I run the following query:

      > db.bible.runCommand("text", { search : "Israel \"Son of man\""} )

      the text gets a score of "score" : 0.5833333333333334

      But, when I remove that weird quite and re-insert the document under a different _id, it gets a score of "score" : 1.1666666666666667

      I would expect them both to get the same score, and for that score to be greater than 1, since it matches one of my search strings.

      • Sorry for the heavily Christian text in the example, I found a random bible verse generating JSON API and just ran with it today.

            adam.chelminski@mongodb.com Adam Chelminski (Inactive)
            william.cross William Cross (Inactive)
            0 Vote for this issue
            9 Start watching this issue
