Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13099

Expand set of delimiters recognized by text search tokenizer

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1.7
    • Component/s: Text Search
    • Labels:
      None
    • Backwards Compatibility:
      Major Change
    • Sprint:
      Platform 6 07/17/15, Platform 8 08/28/15, Platform 7 08/10/15

      Description

      The text search tokenizer considers only the characters \\\f\v\t\r\n\'~`!@#$%^&*(-=+[]{}|;:"<>,. /? as token delimiters. As such, words adjacent to a curly quote (or an emdash, etc) won't be recognized as a stopword or indexed under the correct term. The set of recognized delimiters needs to be expanded and made unicode-aware.

      Original ticket description:

      I find that directional quotes can affect the ranking of search terms in the $text search.

      This is a problem because our ranking is supposed to look at words, and should ignore symbols.

      Steps to reproduce:

      I have the following document*:

      {
        "_id" : {
          "chapter" : "23",
          "bookname" : "Ezekiel",
          "verse" : "2"
          },
        "text" : "“Son of man, there were two women who were daughters of the same mother."
      }

      (note that the text character that appears at the beginning of the string in the "text" field is not a standard quote I get from shift-")

      When I run the following query:

      > db.bible.runCommand("text", { search : "Israel \"Son of man\""} )

      the text gets a score of "score" : 0.5833333333333334

      But, when I remove that weird quite and re-insert the document under a different _id, it gets a score of "score" : 1.1666666666666667

      I would expect them both to get the same score, and for that score to be greater than 1, since it matches one of my search strings.

      • Sorry for the heavily Christian text in the example, I found a random bible verse generating JSON API and just ran with it today.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: