Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17535

FTS doesn't match full words when doing filtering on the set found from index scan

    • Query Integration
    • ALL
    • Hide
      sputnik-rs:PRIMARY> db.ftstest.insert( { data1 : "abcd", data2 : "efgh", data3 : "ijkl" } )
      WriteResult({ "nInserted" : 1 })
      sputnik-rs:PRIMARY> db.ftstest.createIndex( { data1 : "text", data2 : "text", data3 : "text" } )
      {
      	"createdCollectionAutomatically" : false,
      	"numIndexesBefore" : 1,
      	"numIndexesAfter" : 2,
      	"ok" : 1
      }
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efgh" "ijkl"' } } )
      { "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abc" "efg" "ijk"' } } )
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efg" "ijk"' } } )
      { "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"bcd" "fgh" "jkl"' } } )
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "fgh" "jkl"' } } )
      { "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
      sputnik-rs:PRIMARY>
      
      Show
      sputnik-rs:PRIMARY> db.ftstest.insert( { data1 : "abcd" , data2 : "efgh" , data3 : "ijkl" } ) WriteResult({ "nInserted" : 1 }) sputnik-rs:PRIMARY> db.ftstest.createIndex( { data1 : "text" , data2 : "text" , data3 : "text" } ) { "createdCollectionAutomatically" : false , "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : ' "abcd" "efgh" "ijkl" ' } } ) { "_id" : ObjectId( "54fff32bb43016d00d95734a" ), "data1" : "abcd" , "data2" : "efgh" , "data3" : "ijkl" } sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : ' "abc" "efg" "ijk" ' } } ) sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : ' "abcd" "efg" "ijk" ' } } ) { "_id" : ObjectId( "54fff32bb43016d00d95734a" ), "data1" : "abcd" , "data2" : "efgh" , "data3" : "ijkl" } sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : ' "bcd" "fgh" "jkl" ' } } ) sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : ' "abcd" "fgh" "jkl" ' } } ) { "_id" : ObjectId( "54fff32bb43016d00d95734a" ), "data1" : "abcd" , "data2" : "efgh" , "data3" : "ijkl" } sputnik-rs:PRIMARY>

      The following behavior with FTS seems inconsistent:

      sputnik-rs:PRIMARY> db.ftstest.insert( { data1 : "abcd", data2 : "efgh", data3 : "ijkl" } )
      WriteResult({ "nInserted" : 1 })
      sputnik-rs:PRIMARY> db.ftstest.createIndex( { data1 : "text", data2 : "text", data3 : "text" } )
      {
      	"createdCollectionAutomatically" : false,
      	"numIndexesBefore" : 1,
      	"numIndexesAfter" : 2,
      	"ok" : 1
      }
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efgh" "ijkl"' } } )
      { "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abc" "efg" "ijk"' } } )
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "efg" "ijk"' } } )
      { "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"bcd" "fgh" "jkl"' } } )
      sputnik-rs:PRIMARY> db.ftstest.find( { $text : { $search : '"abcd" "fgh" "jkl"' } } )
      { "_id" : ObjectId("54fff32bb43016d00d95734a"), "data1" : "abcd", "data2" : "efgh", "data3" : "ijkl" }
      sputnik-rs:PRIMARY>
      

      What happens above:

      1. In the first query, all search words match and the document is returned. This is expected.
      2. In the second query, we removed the last letter of each word. As a result they no longer match the full words and nothing is returned. This is expected.
      3. In the third query, the first word is again the full 4 letter word, but the two others are part of the word. Since this is an AND search, this should return an empty set, but returns the document because the second and third words match part of the words in the document.
      4. In the fifth and sixth queries the same is demonstrated when removing the first letter of all or some words respectively.

      It seems to me that when scanning the index, MongoDB will match for full words (post stemming). This is expected. However, for documents found from the index scan a filtering step is executed, which actually matches parts of words.

      Without looking at the code, I recognize that this is a common error when using regular expression libraries. For matching full words the syntax ^abcd$ should be used, but a developer may easily forget that and just search for abcd, which will match any strings that includes abcd as a part of itself.

            Assignee:
            backlog-query-integration [DO NOT USE] Backlog - Query Integration
            Reporter:
            henrik.ingo@mongodb.com Henrik Ingo (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: