Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8428

Text search tokens need to be unicode normalized

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.3.2
    • Component/s: Text Search
    • Query
    • ALL

      e.g.

      Insert "café" twice (one with combining accent, one without):

      >>> import pymongo
      >>> testdb = pymongo.MongoClient()['test']
      >>> doc1 = {'_id':1, 'content':u'caf\xe9'}
      >>> doc2 = {'_id':2, 'content':u'cafe\u0301'}
      >>> testdb.foo.insert(doc1)
      1
      >>> testdb.foo.insert(doc2)
      2
      

      But one doesn't get returned in search for the other:

      > db.foo.ensureIndex({content:"text"})
      > db.foo.find()
      { "_id" : 1, "content" : "café" }
      { "_id" : 2, "content" : "café" }
      > s = db.foo.findOne({_id:1}).content
      café
      > db.foo.runCommand("text",{search:s})
      {
      	"queryDebugString" : "café||||||",
      	"language" : "english",
      	"results" : [
      		{
      			"score" : 1.1,
      			"obj" : {
      				"_id" : 1,
      				"content" : "café"
      			}
      		}
      	],
      	"stats" : {
      		"nscanned" : 1,
      		"nscannedObjects" : 0,
      		"n" : 1,
      		"nfound" : 1,
      		"timeMicros" : 104
      	},
      	"ok" : 1
      }
      > 
      

            Assignee:
            backlog-server-query Backlog - Query Team (Inactive)
            Reporter:
            rassi J Rassi
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: