Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3.2
    • Component/s: Indexing, Querying
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible

      Description

      Simple text indexing.
      Initial version will be marked as experimental, so has to be turned on with a command line flag or at runtime.

      db.adminCommand( { setParameter : "*", textSearchEnabled : true } );

      OR

      ./mongod --setParameter textSearchEnabled=true

      Only works via "text" command currently.

      Features:

      • parsing + stemming for latin languages
      • multi-word scoring
      • phrase matching
      • word negation
      • weights per field
      • additional suffix index fields for coverings
      • additional prefix index fields for partitioning
      • specify language per document

      Simple Example:

      > db.foo.insert( { _id: 1 , desc: "the dog is running" } )
      > db.foo.insert( { _id: 2 , desc: "the cat is walking" } )
      > db.foo.ensureIndex( { "desc": "text" } )
      > db.foo.runCommand( "text", { search : "walk" } )
      {
      	"queryDebugString" : "walk||||||",
      	"language" : "english",
      	"results" : [
      		{
      			"score" : 0.75,
      			"obj" : {
      				"_id" : 2,
      				"desc" : "the cat is walking"
      			}
      		}
      	],
      	"stats" : {
      		"nscanned" : 1,
      		"nscannedObjects" : 0,
      		"n" : 1,
      		"timeMicros" : 330
      	},
      	"ok" : 1
      }

        Issue Links

          Activity

          Hide
          jjh Jeremy Hinegardner added a comment -

          I brought up Xapian (http://xapian.org) on the mongodb-users list as a possible library for use:

          http://xapian.org/features

          • Written in C++, so would probably work well with MongoDB
          • Has keyword searching
          • Phrase searching
          • boolean logic
          • stemming

          As for 'real-time' I would suspect its as real-time as any other full text search library.

          I have no opinion on whether this should be used or not, it just sounded like a possible good match with mongodb.

          Show
          jjh Jeremy Hinegardner added a comment - I brought up Xapian ( http://xapian.org ) on the mongodb-users list as a possible library for use: http://xapian.org/features Written in C++, so would probably work well with MongoDB Has keyword searching Phrase searching boolean logic stemming As for 'real-time' I would suspect its as real-time as any other full text search library. I have no opinion on whether this should be used or not, it just sounded like a possible good match with mongodb.
          Hide
          electic Raj Kadam added a comment -

          I guess what I meant by realtime is the full-text engine needs to allow "incremental updates" to the full-text index. If it has to re-index stuff over and over again like Sphinx that is really taxing on the CPU and increases latency drastically under high load environments.

          Show
          electic Raj Kadam added a comment - I guess what I meant by realtime is the full-text engine needs to allow "incremental updates" to the full-text index. If it has to re-index stuff over and over again like Sphinx that is really taxing on the CPU and increases latency drastically under high load environments.
          Hide
          alanw Alan Wright added a comment -

          I would put forward that a better candidate for full-text search would be CLucene (http://clucene.sourceforge.net/) - a C++ port of the popular java Lucene engine.

          Storing the index inside MongoDB would be a simple case of overriding the lucene::store::Directory class to point to MongoDB instead of the file-system.

          Show
          alanw Alan Wright added a comment - I would put forward that a better candidate for full-text search would be CLucene ( http://clucene.sourceforge.net/ ) - a C++ port of the popular java Lucene engine. Storing the index inside MongoDB would be a simple case of overriding the lucene::store::Directory class to point to MongoDB instead of the file-system.
          Hide
          richard Richard Boulton added a comment -

          Hi - Xapian developer here, and Mongo DB enthusiast (though I've not had an excuse to play with it in anger yet). I'd like to help make a tight integration between Xapian and MongoDB, if there's interest in it. I'm not quite sure what the best approach for linking would be, though.

          Xapian certainly supports "realtime" updates in the sense described above. It also has some features in trunk for supporting replication of the index, which might be helpful when working with MongoDB.

          One basic approach would be to hook into the updates in Mongo somehow, and send them across to a parallel Xapian index for full-text indexing. I think this would be best done by defining some kind of schema, though: often, when searching, you want to search across a fairly complex set of fields (a common example is to search across both title fields and content fields, but to boost the importance of the title fields - but in real world search situations, you often come up with much more complex requirements). A naive mapping of a particular field in Mongo to a search index would allow basic search, but we can do much better than that, I think.

          There are also things like thesaurus entries and spelling correction, which you would want to be able to configure somehow.

          Mongo doesn't really have schemas yet, IIRC, so I'm not sure how the Mongo developers would feel about adding that sort of context.

          When defining searches, Xapian has a built-in and flexible query parser (which is aimed at parsing queries entered into a search box by the average untrained user, so supports some structure (eg, field:value), but copes with any random input in a sane way). It can also have structured searches built up, and combined with the output of parsing several user inputs, so a mapping from a Mongo-style query to a Xapian search could be defined to limit Mongo results.

          Xapian also has things called "External Posting Sources" which are arbitrary C++ classes (subclassing a Xapian::PostingSource class), which can be used to perform combined searches across data stored in the Xapian index, and external data. (A "posting source" in search engine terminology is a list of documents matching a particular word (or term) and is the fundamental piece of data stored in a search engine index.) This could be used to limit searches to documents matching a MongoDB query pretty efficiently, without having to store extra data in Xapian.

          Show
          richard Richard Boulton added a comment - Hi - Xapian developer here, and Mongo DB enthusiast (though I've not had an excuse to play with it in anger yet). I'd like to help make a tight integration between Xapian and MongoDB, if there's interest in it. I'm not quite sure what the best approach for linking would be, though. Xapian certainly supports "realtime" updates in the sense described above. It also has some features in trunk for supporting replication of the index, which might be helpful when working with MongoDB. One basic approach would be to hook into the updates in Mongo somehow, and send them across to a parallel Xapian index for full-text indexing. I think this would be best done by defining some kind of schema, though: often, when searching, you want to search across a fairly complex set of fields (a common example is to search across both title fields and content fields, but to boost the importance of the title fields - but in real world search situations, you often come up with much more complex requirements). A naive mapping of a particular field in Mongo to a search index would allow basic search, but we can do much better than that, I think. There are also things like thesaurus entries and spelling correction, which you would want to be able to configure somehow. Mongo doesn't really have schemas yet, IIRC, so I'm not sure how the Mongo developers would feel about adding that sort of context. When defining searches, Xapian has a built-in and flexible query parser (which is aimed at parsing queries entered into a search box by the average untrained user, so supports some structure (eg, field:value), but copes with any random input in a sane way). It can also have structured searches built up, and combined with the output of parsing several user inputs, so a mapping from a Mongo-style query to a Xapian search could be defined to limit Mongo results. Xapian also has things called "External Posting Sources" which are arbitrary C++ classes (subclassing a Xapian::PostingSource class), which can be used to perform combined searches across data stored in the Xapian index, and external data. (A "posting source" in search engine terminology is a list of documents matching a particular word (or term) and is the fundamental piece of data stored in a search engine index.) This could be used to limit searches to documents matching a MongoDB query pretty efficiently, without having to store extra data in Xapian.
          Hide
          exterpassiv David Lehmann added a comment -

          If Mongo gets integrated full text search (FTS), then it should be as light and concise as the existing index functionality. This means, that I'm strongly opposed to any kind of schema except it is as small and simple like the ones used for indexes and unique indexes. If we design it with the regular indexes in mind we could assume two things: it is only useful on text fields and the ordering of the index is less important. The "order"-parameter in the index creation/ensuring call could be used for a weight value if we want to recycle the regular index creation mechanics. We could use an option "fulltext" as we do right now for "unique". The performance tradeoffs when using FTS should be documented like it already is for indexes and unique indexes. The inclusion of documents in the index should be triggered by existence and type check for the index fields on the document to write/update.

          After looking at Sphinx http://www.sphinxsearch.com/docs/current.html, Xapian http://xapian.org/ and CLucene http://sourceforge.net/projects/clucene/ it seems that CLucene is the most flexible. Correct me if I'm wrong but neither Sphinx nor Xapian support custom persistency implementations. Even if we could store the index files of those two engines in GridFS this should not be the way to go. Mongo is an extremely fast database and progresses in light speed when it comes to features for easy replication and usage in cluster architectures. Any other persistency mechanism used for data that could also be stored in Mongo just increases complexity of the setup and brings new and unnecessary problems for both, the developers and users of Mongo. Using Mongo as persistency layer would ensure the availability of its features for clustering (sharding and map/reduce).

          When it comes to the feature set, stemming, keyword search and phrase search are absolutely necessary. The querying mechanics should not include any parsing of fancy search input IMHO but use the Mongo $opcodes. Query languages should also be a part of its own and can be integrated in the language specific drivers via the manipulator mechanics. A simple query parser could be shipped with the Mongo sources and driver providers could use it via the language specific foreign language interface or wrapper builders like SWIG http://www.swig.org/.

          It would be nice if we isolate the parts of FTS that are necessary to build a common infrastructure for FTS integration into Mongo. As mentioned by Alan Wright, CLucene has a clean separation of FTS and persistency and therefore could be a good starting point for our efforts. A common infrastructure would help everybody interested in integrating his or her preferred full text search engine.

          Show
          exterpassiv David Lehmann added a comment - If Mongo gets integrated full text search (FTS), then it should be as light and concise as the existing index functionality. This means, that I'm strongly opposed to any kind of schema except it is as small and simple like the ones used for indexes and unique indexes. If we design it with the regular indexes in mind we could assume two things: it is only useful on text fields and the ordering of the index is less important. The "order"-parameter in the index creation/ensuring call could be used for a weight value if we want to recycle the regular index creation mechanics. We could use an option "fulltext" as we do right now for "unique". The performance tradeoffs when using FTS should be documented like it already is for indexes and unique indexes. The inclusion of documents in the index should be triggered by existence and type check for the index fields on the document to write/update. After looking at Sphinx http://www.sphinxsearch.com/docs/current.html , Xapian http://xapian.org/ and CLucene http://sourceforge.net/projects/clucene/ it seems that CLucene is the most flexible. Correct me if I'm wrong but neither Sphinx nor Xapian support custom persistency implementations. Even if we could store the index files of those two engines in GridFS this should not be the way to go. Mongo is an extremely fast database and progresses in light speed when it comes to features for easy replication and usage in cluster architectures. Any other persistency mechanism used for data that could also be stored in Mongo just increases complexity of the setup and brings new and unnecessary problems for both, the developers and users of Mongo. Using Mongo as persistency layer would ensure the availability of its features for clustering (sharding and map/reduce). When it comes to the feature set, stemming, keyword search and phrase search are absolutely necessary. The querying mechanics should not include any parsing of fancy search input IMHO but use the Mongo $opcodes. Query languages should also be a part of its own and can be integrated in the language specific drivers via the manipulator mechanics. A simple query parser could be shipped with the Mongo sources and driver providers could use it via the language specific foreign language interface or wrapper builders like SWIG http://www.swig.org/ . It would be nice if we isolate the parts of FTS that are necessary to build a common infrastructure for FTS integration into Mongo. As mentioned by Alan Wright, CLucene has a clean separation of FTS and persistency and therefore could be a good starting point for our efforts. A common infrastructure would help everybody interested in integrating his or her preferred full text search engine.
          Hide
          eliot Eliot Horowitz added a comment -

          I think the right way to do this is the following.

          • make the db take: db.foo.ensureIndex( { title : 1 }

            ,

            { fullTextSearch : true }

            )
            db.foo.ensureIndex(

            { title : 1 , test : .2 }

            ,

            { fullTextSearch : true }

            )

          • modify the indexing code to use clucene (or whichever) engine from tokenzing, stemming, and put those in the index
          • query would be something like: db.foo.find( { $ft : "cool stuff" }

            )

          Show
          eliot Eliot Horowitz added a comment - I think the right way to do this is the following. make the db take: db.foo.ensureIndex( { title : 1 } , { fullTextSearch : true } ) db.foo.ensureIndex( { title : 1 , test : .2 } , { fullTextSearch : true } ) modify the indexing code to use clucene (or whichever) engine from tokenzing, stemming, and put those in the index query would be something like: db.foo.find( { $ft : "cool stuff" } )
          Hide
          alanw Alan Wright added a comment -

          Eliot - that looks good.

          Adding full text searching to MongoDB and making it as easy to use as you've described would be fantastic!

          Show
          alanw Alan Wright added a comment - Eliot - that looks good. Adding full text searching to MongoDB and making it as easy to use as you've described would be fantastic!
          Hide
          exterpassiv David Lehmann added a comment - - edited

          @eliot yep, that's what i wanted to say

          what do you think about the idea of implementing the query parser as a client side manipulator?

          Show
          exterpassiv David Lehmann added a comment - - edited @eliot yep, that's what i wanted to say what do you think about the idea of implementing the query parser as a client side manipulator?
          Hide
          eliot Eliot Horowitz added a comment -

          @david Not sure what you mean my that?

          Show
          eliot Eliot Horowitz added a comment - @david Not sure what you mean my that?
          Hide
          exterpassiv David Lehmann added a comment -

          @eliot maybe i got it wrong.

          {$ft: "cool stuff"} means phrase-, key-search or query language?

          Show
          exterpassiv David Lehmann added a comment - @eliot maybe i got it wrong. {$ft: "cool stuff"} means phrase-, key-search or query language?
          Hide
          eliot Eliot Horowitz added a comment -

          keywork search
          so "cool stuff" would look up matches for cool, and stuff, and then intersection, scoring, etc...
          could eventually go into a query language, but not initally probably

          Show
          eliot Eliot Horowitz added a comment - keywork search so "cool stuff" would look up matches for cool, and stuff, and then intersection, scoring, etc... could eventually go into a query language, but not initally probably
          Hide
          exterpassiv David Lehmann added a comment -

          sry, mixed up manipulator with modifiers ^^ corrected in my previous comment

          i don't see how a simple {$ft: "something keyword like"} could help me in finding docs if i have more than one ftindex.

          a keyword search could/should be something like the $or query in SERVER-205

          {$ft: {title: ["keyword", "search"]}}

          Show
          exterpassiv David Lehmann added a comment - sry, mixed up manipulator with modifiers ^^ corrected in my previous comment i don't see how a simple {$ft: "something keyword like"} could help me in finding docs if i have more than one ftindex. a keyword search could/should be something like the $or query in SERVER-205 {$ft: {title: ["keyword", "search"] }}
          Hide
          alanw Alan Wright added a comment -

          Wouldn't the query follow the Lucene query syntax? (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html)

          It might also be useful to search specified fields (also described in the Lucene syntax)...

          db.foo.find(

          { $ft : "title:cool AND title:stuff" }

          )

          db.foo.find(

          { $ft : "title:cool OR test:stuff" }

          )

          At least in the first version?

          In future versions, indexing could be expanded to perhaps allow specifying language analysers (useful when stemming)...

          db.foo.ensureIndex(

          { title : 1 }

          ,

          { fullTextSearch : true }

          ,

          { textAnalyzer : standard }

          )
          db.foo.ensureIndex(

          { title : 1 }

          ,

          { fullTextSearch : true }

          ,

          { textAnalyzer : snowball }

          ,

          { textLanguage : German }

          )

          Show
          alanw Alan Wright added a comment - Wouldn't the query follow the Lucene query syntax? ( http://lucene.apache.org/java/2_3_2/queryparsersyntax.html ) It might also be useful to search specified fields (also described in the Lucene syntax)... db.foo.find( { $ft : "title:cool AND title:stuff" } ) db.foo.find( { $ft : "title:cool OR test:stuff" } ) At least in the first version? In future versions, indexing could be expanded to perhaps allow specifying language analysers (useful when stemming)... db.foo.ensureIndex( { title : 1 } , { fullTextSearch : true } , { textAnalyzer : standard } ) db.foo.ensureIndex( { title : 1 } , { fullTextSearch : true } , { textAnalyzer : snowball } , { textLanguage : German } )
          Hide
          electic Raj Kadam added a comment -

          These are all good suggestions, but what would be the latency from time to insert to time in ft index? I mean what is the ideal latency?

          Show
          electic Raj Kadam added a comment - These are all good suggestions, but what would be the latency from time to insert to time in ft index? I mean what is the ideal latency?
          Hide
          exterpassiv David Lehmann added a comment -

          @alan: I guess it would be harder to use Mongo replication/sharding facilities if the queries are not just plain Mongo queries but I'm not sure about this. The important functionality of the FTS engine is the analysis/transformation of the input data ... could be wrong on this but querying should be left to Mongo and therefore the "query language" of Mongo should be used. Combined with map/reduce this would be very powerfull. The transformation from a high level query language to regular Mongo queries or map/reduce should be in the application layer or maybe better in the language specific driver.

          @raj: If the FTS uses Mongo indexes, the penalty is payed while inserting/deleting a doc. The index is up-to-date after the successful insert/delete. It's the same with the standard indexes. Fulltext indexes for collections that have high insert/delete rates are even more counterproductive as regular indexes because of the nature of natural language analysis algorithms. This should get even worse if word distance is part of the feature set.

          Maybe I'm thinking to complicated and a simple OR based prefix-, infix- and postfix-keyword-search with snowball stemming for the easy stemmable languages would be fine for 99% of Mongos users. Will ask that in #mongodb if I find time.

          Show
          exterpassiv David Lehmann added a comment - @alan: I guess it would be harder to use Mongo replication/sharding facilities if the queries are not just plain Mongo queries but I'm not sure about this. The important functionality of the FTS engine is the analysis/transformation of the input data ... could be wrong on this but querying should be left to Mongo and therefore the "query language" of Mongo should be used. Combined with map/reduce this would be very powerfull. The transformation from a high level query language to regular Mongo queries or map/reduce should be in the application layer or maybe better in the language specific driver. @raj: If the FTS uses Mongo indexes, the penalty is payed while inserting/deleting a doc. The index is up-to-date after the successful insert/delete. It's the same with the standard indexes. Fulltext indexes for collections that have high insert/delete rates are even more counterproductive as regular indexes because of the nature of natural language analysis algorithms. This should get even worse if word distance is part of the feature set. Maybe I'm thinking to complicated and a simple OR based prefix-, infix- and postfix-keyword-search with snowball stemming for the easy stemmable languages would be fine for 99% of Mongos users. Will ask that in #mongodb if I find time.
          Hide
          eliot Eliot Horowitz added a comment -

          I think distance, etc.. can be done on the query side rather than the indexing side.
          a search for "cool stuff"

          1) find "cool"
          2) find "stuff"
          3) find intersection
          4) score that subset

          if its done on the query side can use multiple cores, etc...

          Show
          eliot Eliot Horowitz added a comment - I think distance, etc.. can be done on the query side rather than the indexing side. a search for "cool stuff" 1) find "cool" 2) find "stuff" 3) find intersection 4) score that subset if its done on the query side can use multiple cores, etc...
          Hide
          exterpassiv David Lehmann added a comment -

          With distance I mean the distance used in proximity searches not weight.

          Show
          exterpassiv David Lehmann added a comment - With distance I mean the distance used in proximity searches not weight.
          Hide
          eliot Eliot Horowitz added a comment -

          @David - right, that's what i meant as well

          Show
          eliot Eliot Horowitz added a comment - @David - right, that's what i meant as well
          Hide
          exterpassiv David Lehmann added a comment -

          @eliot good to know Which data would you put in the index and how would you do the scoring?

          Show
          exterpassiv David Lehmann added a comment - @eliot good to know Which data would you put in the index and how would you do the scoring?
          Hide
          eliot Eliot Horowitz added a comment -

          the index would just be the words.
          scoring would be after and can be based on proximity, etc...

          Show
          eliot Eliot Horowitz added a comment - the index would just be the words. scoring would be after and can be based on proximity, etc...
          Hide
          davoodoo Sebastian Friedel added a comment -

          hello everyone
          ok, so if you want to have infix search with a minimal infix length of 2 you'll have an index as follows:
          field: ['co', 'oo', 'ol', 'coo', 'ool', 'cool', 'st', 'tu', 'uf', 'ff', 'stu', 'tuf', 'uff', 'stuf', 'tuff', 'stuff']
          how would you extrapolate word proximity from that?

          Show
          davoodoo Sebastian Friedel added a comment - hello everyone ok, so if you want to have infix search with a minimal infix length of 2 you'll have an index as follows: field: ['co', 'oo', 'ol', 'coo', 'ool', 'cool', 'st', 'tu', 'uf', 'ff', 'stu', 'tuf', 'uff', 'stuf', 'tuff', 'stuff'] how would you extrapolate word proximity from that?
          Hide
          eliot Eliot Horowitz added a comment -

          i don't think version 1 would have substring matching.
          more like a regular search engine.

          i think that would be a different mode, since its a lot more costly, and not as often needed

          Show
          eliot Eliot Horowitz added a comment - i don't think version 1 would have substring matching. more like a regular search engine. i think that would be a different mode, since its a lot more costly, and not as often needed
          Hide
          davoodoo Sebastian Friedel added a comment -

          I don't think that it is so seldomly needed. Think of things like 'waterbed' or 'waterdrop' ... as a user I would find it very irritating if I wouldn't find any of those when I search for 'water'.
          And there are languages where connected words are much more common than in english.
          I could give more examples. but I think the above already states my point.

          Show
          davoodoo Sebastian Friedel added a comment - I don't think that it is so seldomly needed. Think of things like 'waterbed' or 'waterdrop' ... as a user I would find it very irritating if I wouldn't find any of those when I search for 'water'. And there are languages where connected words are much more common than in english. I could give more examples. but I think the above already states my point.
          Hide
          alanw Alan Wright added a comment -

          @Sebastian - I would take a look at the Lucene Wildcards (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Wildcard%20Searches). This would provide the capability you need (eg. searching for "water*" would find "waterdrop" and "waterbed")

          Show
          alanw Alan Wright added a comment - @Sebastian - I would take a look at the Lucene Wildcards ( http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Wildcard%20Searches ). This would provide the capability you need (eg. searching for "water*" would find "waterdrop" and "waterbed")
          Hide
          eliot Eliot Horowitz added a comment -

          A lot of it comes down to covering all the basics while keeping it simple and fast.
          Since this will be real time - we need to be very speed aware and really just make sure it covers most use cases.
          While the "water" example is good, I still think its less common and adds almost an oder of magnitude more complexity.
          not saying it can't happen, just saying probably not for version 1.

          also, while we work on basic full text search inside Mongo, it still might make sense for more adapters to an outside, non-realtime search engine that could be more complex.

          Show
          eliot Eliot Horowitz added a comment - A lot of it comes down to covering all the basics while keeping it simple and fast. Since this will be real time - we need to be very speed aware and really just make sure it covers most use cases. While the "water" example is good, I still think its less common and adds almost an oder of magnitude more complexity. not saying it can't happen, just saying probably not for version 1. also, while we work on basic full text search inside Mongo, it still might make sense for more adapters to an outside, non-realtime search engine that could be more complex.
          Hide
          davoodoo Sebastian Friedel added a comment -

          @Alan: having integrated search engines several times in the last years I know that every serious search implements at least prefix or infix search alongside with stemming
          @Eliot: ok, so you would like the fts in mongodb to be just basic and simple and let the more 'fancy' use cases be handled by specialized software such as sphinx/lucene/xapian?

          Show
          davoodoo Sebastian Friedel added a comment - @Alan: having integrated search engines several times in the last years I know that every serious search implements at least prefix or infix search alongside with stemming @Eliot: ok, so you would like the fts in mongodb to be just basic and simple and let the more 'fancy' use cases be handled by specialized software such as sphinx/lucene/xapian?
          Hide
          eliot Eliot Horowitz added a comment -

          Maybe.

          It could even be clucene with a mongo backend.
          They key is that what's "builtin" needs to be real time and very efficient.
          So we want to do the minimal work that solves a lot of real world problems.

          Show
          eliot Eliot Horowitz added a comment - Maybe. It could even be clucene with a mongo backend. They key is that what's "builtin" needs to be real time and very efficient. So we want to do the minimal work that solves a lot of real world problems.
          Hide
          exterpassiv David Lehmann added a comment -

          @eliot: That's why I asked what we can put into the indexes and what could be done in the client implementations/manipulators

          I'm sure that it is possible to have real full text search in Mongo with a query language that is a superset (doh, a connected word again ) of standard queries. We have to find out which parts have to stay in the clients or application layer and what we wanna do in the write/delete ops of the DB. Main "problem" are Mongo indexes - but just if we wanna use them for storage of the entire search indexes. This gets clear if we take a closer look at proximity search but it looks not like a long time blocker to me.

          A first step would be to do exactly what you propose when you talk about Lucene with a Mongo backend - could be any engine with a clean separation of NLA and persistency. The simple search you described earlier would be the "ASCII of search". This would not deliver the wanted results in languages like French, German ... you name it we have it ) IMHO everyone who wants to have non english FTS in Mongo just would not use the simple one because it wouldn't cover their needs.

          Show
          exterpassiv David Lehmann added a comment - @eliot: That's why I asked what we can put into the indexes and what could be done in the client implementations/manipulators I'm sure that it is possible to have real full text search in Mongo with a query language that is a superset (doh, a connected word again ) of standard queries. We have to find out which parts have to stay in the clients or application layer and what we wanna do in the write/delete ops of the DB. Main "problem" are Mongo indexes - but just if we wanna use them for storage of the entire search indexes. This gets clear if we take a closer look at proximity search but it looks not like a long time blocker to me. A first step would be to do exactly what you propose when you talk about Lucene with a Mongo backend - could be any engine with a clean separation of NLA and persistency. The simple search you described earlier would be the "ASCII of search". This would not deliver the wanted results in languages like French, German ... you name it we have it ) IMHO everyone who wants to have non english FTS in Mongo just would not use the simple one because it wouldn't cover their needs.
          Hide
          exterpassiv David Lehmann added a comment -

          I forgot to say, that I'm not talking about of what 10gen should do as a first step Sebastian and I have plans to implement FTS anyway and just want to make sure that we don't just cover our needs.

          Show
          exterpassiv David Lehmann added a comment - I forgot to say, that I'm not talking about of what 10gen should do as a first step Sebastian and I have plans to implement FTS anyway and just want to make sure that we don't just cover our needs.
          Hide
          nicolas_ Nicolas Fouché added a comment -

          I've developed my own search feature on MongoDB, with $ft keys. If you store documents with all languages, stemming is too much pain. (language detection, then apply the good stemmer if one can be applied - and sometimes language detection per paragraph/sentence is needed).

          Like Eliot says, if MongoDB embeds a full-text search feature, it should be as minimal as possible. Extract words, convert them to ASCII, remove 1 or 2 characters words, and puts them in an Array key. The more MongoDB does, the more choices they make, and the less use cases they'll match.

          Stemming can be (kind of) replaced by expanding the query http://en.wikipedia.org/wiki/Query_expansion . For that I still wait for the $or feature Query expansion adds more processing at runtime, but makes the database a lot more flexible. No need to migrate data each time you enhance your algorithms for example.

          You build above that a query grammar (with Treetop for example http://treetop.rubyforge.org/), create multiple keys for full text and you have a search engine that supports metadata queries. e.g.: "waterdrop subject:house date:>20090101" And as soon as the $or feature is ready, the user could add OR keywords in their queries (to satisfy one or two of them).

          If anyone is interested, I can write blog articles to describe this solution in depth.

          Of course you don't have wildcards, phrase search, fuzzy search, nearby search or scoring. But I suppose that if you need this, then you definitely don't target average users. Take a look at the search feature in Facebook messages powered by Cassandra, it's horrible (not-english people would understand), it does not even find the same exact words you type in your query... but it's blazing fast and no-one complains. It seems that Twitter added phrase search recently, Digg did not, neither yammer. As a former Lucene user, I though that I needed all these features, but I discovered that none of our users asked for them, and I do not actually need them to find what I'm looking for. In startups we don't want to spend 80% of the time to satisfy 1% of our users

          For developers needing a featureful search, then why not considering an "external hook" à la CouchDB ?

          Show
          nicolas_ Nicolas Fouché added a comment - I've developed my own search feature on MongoDB, with $ft keys. If you store documents with all languages, stemming is too much pain. (language detection, then apply the good stemmer if one can be applied - and sometimes language detection per paragraph/sentence is needed). Like Eliot says, if MongoDB embeds a full-text search feature, it should be as minimal as possible. Extract words, convert them to ASCII, remove 1 or 2 characters words, and puts them in an Array key. The more MongoDB does, the more choices they make, and the less use cases they'll match. Stemming can be (kind of) replaced by expanding the query http://en.wikipedia.org/wiki/Query_expansion . For that I still wait for the $or feature Query expansion adds more processing at runtime, but makes the database a lot more flexible. No need to migrate data each time you enhance your algorithms for example. You build above that a query grammar (with Treetop for example http://treetop.rubyforge.org/ ), create multiple keys for full text and you have a search engine that supports metadata queries. e.g.: "waterdrop subject:house date:>20090101" And as soon as the $or feature is ready, the user could add OR keywords in their queries (to satisfy one or two of them). If anyone is interested, I can write blog articles to describe this solution in depth. Of course you don't have wildcards, phrase search, fuzzy search, nearby search or scoring. But I suppose that if you need this, then you definitely don't target average users. Take a look at the search feature in Facebook messages powered by Cassandra, it's horrible (not-english people would understand), it does not even find the same exact words you type in your query... but it's blazing fast and no-one complains. It seems that Twitter added phrase search recently, Digg did not, neither yammer. As a former Lucene user, I though that I needed all these features, but I discovered that none of our users asked for them, and I do not actually need them to find what I'm looking for. In startups we don't want to spend 80% of the time to satisfy 1% of our users For developers needing a featureful search, then why not considering an "external hook" à la CouchDB ?
          Hide
          eliot Eliot Horowitz added a comment -

          No licensing issues I believe - certainly not in building yourself.

          The external hook is very easy to get started with. Very simple to tap into mongo's replication log and use that for updating indexes.
          I'd love to work with anyone on that.

          Embedding is more complicated. We just built a module system that allows external modules to be linked in.
          We're going to be adding more hooks into that, so it would be possible to have a c++ trigger basically.
          That would be a good 2nd level integration.

          the 3rd level would having it be totally built in. that's a ways off, and i'm not sure if its something that is needed anyway

          Show
          eliot Eliot Horowitz added a comment - No licensing issues I believe - certainly not in building yourself. The external hook is very easy to get started with. Very simple to tap into mongo's replication log and use that for updating indexes. I'd love to work with anyone on that. Embedding is more complicated. We just built a module system that allows external modules to be linked in. We're going to be adding more hooks into that, so it would be possible to have a c++ trigger basically. That would be a good 2nd level integration. the 3rd level would having it be totally built in. that's a ways off, and i'm not sure if its something that is needed anyway
          Hide
          eliot Eliot Horowitz added a comment -

          Most of what you need should be here: http://www.mongodb.org/display/DOCS/Replication+Internals
          the transaction log is a capped collection, so its just a matter of reading from that.

          Show
          eliot Eliot Horowitz added a comment - Most of what you need should be here: http://www.mongodb.org/display/DOCS/Replication+Internals the transaction log is a capped collection, so its just a matter of reading from that.
          Hide
          gf gf added a comment -

          Sphinx rules. The "incremental updates" feature is coming soon.
          Full re-indexing is possible using "xmlpipe2" data-source.
          Live updates will be possible using replication (like MongoNode).

          Show
          gf gf added a comment - Sphinx rules. The "incremental updates" feature is coming soon. Full re-indexing is possible using "xmlpipe2" data-source. Live updates will be possible using replication (like MongoNode).
          Hide
          electic Raj Kadam added a comment -

          If Sphinx does incremental updates, then yes, I believe it is at the top of the pack.

          Show
          electic Raj Kadam added a comment - If Sphinx does incremental updates, then yes, I believe it is at the top of the pack.
          Hide
          eliot Eliot Horowitz added a comment -

          trying to keep 1.5/1.6 very focused on sharding + replica sets.
          will try to get it out asap so we can go back to features like these

          Show
          eliot Eliot Horowitz added a comment - trying to keep 1.5/1.6 very focused on sharding + replica sets. will try to get it out asap so we can go back to features like these
          Hide
          djames David James added a comment -

          @Nicolas Fouché: just saw your comment about using treetop to create a query language, so I wanted to share a little code I put together: http://github.com/djsun/query_string_filter which converts a 'filter' param in a query string to a hash suitable for MongoMapper.

          Show
          djames David James added a comment - @Nicolas Fouché: just saw your comment about using treetop to create a query language, so I wanted to share a little code I put together: http://github.com/djsun/query_string_filter which converts a 'filter' param in a query string to a hash suitable for MongoMapper.
          Hide
          rogerbinns Roger Binns added a comment -

          Another requirement for me not mentioned here is providing information for autocompleting fields in a web interface.

          Show
          rogerbinns Roger Binns added a comment - Another requirement for me not mentioned here is providing information for autocompleting fields in a web interface.
          Hide
          gf gf added a comment -

          http://sphinxsearch.com/
          Sphinx 1.10-beta is out including real-time indexes.

          Show
          gf gf added a comment - http://sphinxsearch.com/ Sphinx 1.10-beta is out including real-time indexes.
          Hide
          howthebodyworks dan added a comment -

          My company has written a mongo-native full-text search. Currently supports English - although stemmer commits are welcome. There is also a python library, which has substantial extra functionality because of restrictions on server-side javascript execution. Indexing happens via mapreduce for maximual concurrency. v8 build recommended for speed- our trials report about 4x speed increase

          http://github.com/glamkit/mongo-search

          Show
          howthebodyworks dan added a comment - My company has written a mongo-native full-text search. Currently supports English - although stemmer commits are welcome. There is also a python library, which has substantial extra functionality because of restrictions on server-side javascript execution. Indexing happens via mapreduce for maximual concurrency. v8 build recommended for speed- our trials report about 4x speed increase http://github.com/glamkit/mongo-search
          Hide
          apgdb Andrew G added a comment -

          In case anyone is interested, I have written a prototype desktop application for which I need a database with text index/ search facility (with incremental updates). It runs on kubuntu if deb package python-storm is installed. It should run under Windows if you install enough things (Python 2.6, Qt, PyQt, Canonical Storm).

          http://kde-apps.org/content/show.php/Knowledge?content=111504

          article here

          http://dot.kde.org/2010/06/29/knowledge-different-approach-database-desktop

          Show
          apgdb Andrew G added a comment - In case anyone is interested, I have written a prototype desktop application for which I need a database with text index/ search facility (with incremental updates). It runs on kubuntu if deb package python-storm is installed. It should run under Windows if you install enough things (Python 2.6, Qt, PyQt, Canonical Storm). http://kde-apps.org/content/show.php/Knowledge?content=111504 article here http://dot.kde.org/2010/06/29/knowledge-different-approach-database-desktop
          Hide
          robfromboulder Rob added a comment -

          I'd also like to vote for these:

          • Search by keywords
          • Real-time
          • Allow for phrase searching

          I'm all for keeping MongoDB simple as others have stated. I agree how stemming/wildcards could be deferred to advanced cases.

          But if you leave out phrase searching, there's no advantage over just breaking words into an array, is there?

          Show
          robfromboulder Rob added a comment - I'd also like to vote for these: Search by keywords Real-time Allow for phrase searching I'm all for keeping MongoDB simple as others have stated. I agree how stemming/wildcards could be deferred to advanced cases. But if you leave out phrase searching, there's no advantage over just breaking words into an array, is there?
          Hide
          howthebodyworks dan added a comment -

          you can add phrasal searching to the library we have produced using simple javascript. patches welcome. we've been keen to keep the library fast and simple so far, and haven't added phrasal search ad such because a well-ranked stemming search has done the job for us very well?

          As for real-time-ness... i guess that would require c++ level support, unless you were keen to implement the indexing function in a client library. mapreduce is currently the only option for non-blocking server-side JS execution.

          What do you mean by "keyword search" specifically?

          Show
          howthebodyworks dan added a comment - you can add phrasal searching to the library we have produced using simple javascript. patches welcome. we've been keen to keep the library fast and simple so far, and haven't added phrasal search ad such because a well-ranked stemming search has done the job for us very well? As for real-time-ness... i guess that would require c++ level support, unless you were keen to implement the indexing function in a client library. mapreduce is currently the only option for non-blocking server-side JS execution. What do you mean by "keyword search" specifically?
          Hide
          mikejs Michael Stephens added a comment -

          I've started work on a little tool (http://github.com/mikejs/photovoltaic) that uses mongo's replication internals to automatically keep a solr index up to date. It's rough around the edges but my FTS needs aren't very fancy.

          Show
          mikejs Michael Stephens added a comment - I've started work on a little tool ( http://github.com/mikejs/photovoltaic ) that uses mongo's replication internals to automatically keep a solr index up to date. It's rough around the edges but my FTS needs aren't very fancy.
          Hide
          klondike Eric Mill added a comment -

          Is this still targeted for release in 1.7? I'm gauging whether I should hold out, or go try to integrate with Solor.

          Show
          klondike Eric Mill added a comment - Is this still targeted for release in 1.7? I'm gauging whether I should hold out, or go try to integrate with Solor.
          Hide
          eliot Eliot Horowitz added a comment -

          Its being considered for 1.7.x
          Not committed yet - though we'd like to

          Show
          eliot Eliot Horowitz added a comment - Its being considered for 1.7.x Not committed yet - though we'd like to
          Hide
          howthebodyworks dan added a comment -

          @eric - So you didn't like our native Mongo search? Is phrasal searching the only show stopper?
          @eliot - I'm hoping you folks will use the work we did on out FTS for mongo itself. It has test suites and such. Of course, if you are going to roll a new one in raw C++ that will change things.

          Show
          howthebodyworks dan added a comment - @eric - So you didn't like our native Mongo search? Is phrasal searching the only show stopper? @eliot - I'm hoping you folks will use the work we did on out FTS for mongo itself. It has test suites and such. Of course, if you are going to roll a new one in raw C++ that will change things.
          Hide
          eliot Eliot Horowitz added a comment -

          @dan - whatever we do will definitely be embedded in the db written in c++

          Show
          eliot Eliot Horowitz added a comment - @dan - whatever we do will definitely be embedded in the db written in c++
          Hide
          klondike Eric Mill added a comment -

          @dan - I'd consider using your library if there were documentation - I have little experience with MongoDB plugins and I'm not sure how to use your code. Phrasal search is important, but I would be fine using a solution that lacked it as a stopgap.

          @eliot - I'll be crossing my fingers, then.

          Show
          klondike Eric Mill added a comment - @dan - I'd consider using your library if there were documentation - I have little experience with MongoDB plugins and I'm not sure how to use your code. Phrasal search is important, but I would be fine using a solution that lacked it as a stopgap. @eliot - I'll be crossing my fingers, then.
          Hide
          wonderman huangzhijian added a comment -

          @Eliot Horowitz iam really excitely expecting the full-text search functionality , it must be fantastic if it can support chinese language

          Show
          wonderman huangzhijian added a comment - @Eliot Horowitz iam really excitely expecting the full-text search functionality , it must be fantastic if it can support chinese language
          Hide
          eliot Eliot Horowitz added a comment -

          Does anyone watching this case have a dataset and tests with another system (mysql,xapian,etc...)
          Would be very helpful.

          Show
          eliot Eliot Horowitz added a comment - Does anyone watching this case have a dataset and tests with another system (mysql,xapian,etc...) Would be very helpful.
          Hide
          jbergstroem Johan Bergström added a comment -

          Eliot: I have a system using postgres to mongo (transition period) and xapian for full text search. What kind of input do you seek? (sorry, this isn't oss)

          Show
          jbergstroem Johan Bergström added a comment - Eliot: I have a system using postgres to mongo (transition period) and xapian for full text search. What kind of input do you seek? (sorry, this isn't oss)
          Hide
          eliot Eliot Horowitz added a comment -

          Ideally a data set and some test cases. Don't need any code.

          Show
          eliot Eliot Horowitz added a comment - Ideally a data set and some test cases. Don't need any code.
          Hide
          shadowman131 Walt Woods added a comment -

          @Eliot Horowitz - Question about your above-mentioned ensureIndex(

          { title: 1, body: 0.2 }

          ,

          { fullTextSearch: true }

          ) API example; are these per-word weights or per-document weights?

          Show
          shadowman131 Walt Woods added a comment - @Eliot Horowitz - Question about your above-mentioned ensureIndex( { title: 1, body: 0.2 } , { fullTextSearch: true } ) API example; are these per-word weights or per-document weights?
          Hide
          eliot Eliot Horowitz added a comment -

          @walt - idea is per word if understand what you mean

          Show
          eliot Eliot Horowitz added a comment - @walt - idea is per word if understand what you mean
          Hide
          shadowman131 Walt Woods added a comment -

          @Eliot Horowitz - Ah... Do you think it would be a possibility to provide per-document weights? Titles are almost always short, but bodies are much more variable; in my opinion, it's not very fair to count a longer document as more pertinent to the subject requested.

          Show
          shadowman131 Walt Woods added a comment - @Eliot Horowitz - Ah... Do you think it would be a possibility to provide per-document weights? Titles are almost always short, but bodies are much more variable; in my opinion, it's not very fair to count a longer document as more pertinent to the subject requested.
          Hide
          eliot Eliot Horowitz added a comment -

          Can you describe exactly what you mean by per word and per document

          Show
          eliot Eliot Horowitz added a comment - Can you describe exactly what you mean by per word and per document
          Hide
          shadowman131 Walt Woods added a comment -

          Yeah; per-word weighting: Count occurrences of word. Multiply occurrences by weight for that word's total weight in document. Store in index.

          Per-document weighting: Count occurrences of a single word, divide by total occurrence count for the indexed field (e.g. # of words). Multiply this fraction by the specified index weight. Store in index.

          Essentially, using weights to bound the effects of any single field in the full text index with respect to the total document score. Maybe this is obvious and how it was going to be done anyway? Admissibly, I don't have much experience with other full-text indexers... just what I've done in experimenting with my own variation for awhile now.

          Show
          shadowman131 Walt Woods added a comment - Yeah; per-word weighting: Count occurrences of word. Multiply occurrences by weight for that word's total weight in document. Store in index. Per-document weighting: Count occurrences of a single word, divide by total occurrence count for the indexed field (e.g. # of words). Multiply this fraction by the specified index weight. Store in index. Essentially, using weights to bound the effects of any single field in the full text index with respect to the total document score. Maybe this is obvious and how it was going to be done anyway? Admissibly, I don't have much experience with other full-text indexers... just what I've done in experimenting with my own variation for awhile now.
          Hide
          rogerbinns Roger Binns added a comment -

          @Walt: There are already scoring algorithms developed and tuned over the years. For example see BM25:

          http://en.wikipedia.org/wiki/Okapi_BM25

          BM25F would take into account multiple fields for a document (eg a title in addition to a body).

          There is an open source pure Python text search library Whoosh that implements this scoring algorithm hence providing some nice reference code. I believe it is also part of Xapian etc.

          http://bitbucket.org/mchaput/whoosh/wiki/Home

          Show
          rogerbinns Roger Binns added a comment - @Walt: There are already scoring algorithms developed and tuned over the years. For example see BM25: http://en.wikipedia.org/wiki/Okapi_BM25 BM25F would take into account multiple fields for a document (eg a title in addition to a body). There is an open source pure Python text search library Whoosh that implements this scoring algorithm hence providing some nice reference code. I believe it is also part of Xapian etc. http://bitbucket.org/mchaput/whoosh/wiki/Home
          Hide
          eliot Eliot Horowitz added a comment -

          We have a proof of concept working in the lab - but want to make sure its rock solid before releasing

          Show
          eliot Eliot Horowitz added a comment - We have a proof of concept working in the lab - but want to make sure its rock solid before releasing
          Hide
          dlee David Lee added a comment -

          That's great! Do you plan on supporting stemming and phrase matching?

          Show
          dlee David Lee added a comment - That's great! Do you plan on supporting stemming and phrase matching?
          Hide
          eliot Eliot Horowitz added a comment -

          Stemming for sure - phrase may or not be in version 1.

          Show
          eliot Eliot Horowitz added a comment - Stemming for sure - phrase may or not be in version 1.
          Hide
          gf gf added a comment -

          Eliot Horowitz: why don't you want to use Sphinx? Implementation is easy for sure. That's feature-rich and robust.

          Show
          gf gf added a comment - Eliot Horowitz: why don't you want to use Sphinx? Implementation is easy for sure. That's feature-rich and robust.
          Hide
          mitchitized Mitch Pirtle added a comment -

          Sphinx doesn't have partial updates to indexes IIRC. You'd have to regenerate all indexes from scratch for every update.

          I've been looking into Elastic Search as a workaround for now, but other priorities keep me from getting it done.

          Show
          mitchitized Mitch Pirtle added a comment - Sphinx doesn't have partial updates to indexes IIRC. You'd have to regenerate all indexes from scratch for every update. I've been looking into Elastic Search as a workaround for now, but other priorities keep me from getting it done.
          Hide
          gf gf added a comment -

          Mitch Pirtle: Check the sphinx site. "Jul 19, 2010. Sphinx 1.10-beta is out: We're happy to release Sphinx 1.10-beta, with a new major version number that means, as promised, !!!Unable to render embedded object: File (real-time) not found.!! indexes support."

          Show
          gf gf added a comment - Mitch Pirtle: Check the sphinx site. "Jul 19, 2010. Sphinx 1.10-beta is out: We're happy to release Sphinx 1.10-beta, with a new major version number that means, as promised, !!! Unable to render embedded object: File (real-time) not found. !! indexes support."
          Hide
          dlee David Lee added a comment -

          gf: The Sphinx real-time indexing is only available for SphinxQL, which is not a good option for MongoDB.

          Show
          dlee David Lee added a comment - gf: The Sphinx real-time indexing is only available for SphinxQL, which is not a good option for MongoDB.
          Hide
          mlen Matt L added a comment -

          ..and the Sphinx 1.10 (with RT) is only available as a beta (so far), and has a lot of bugs (see their's bug-tracker/forum). Tried use it on my project, but gave up.

          Show
          mlen Matt L added a comment - ..and the Sphinx 1.10 (with RT) is only available as a beta (so far), and has a lot of bugs (see their's bug-tracker/forum). Tried use it on my project, but gave up.
          Hide
          eliot Eliot Horowitz added a comment -

          A good sphinx adapter would be good as well, but we want something embedded with no external depencies so for most cases you don't need any other systems.

          Show
          eliot Eliot Horowitz added a comment - A good sphinx adapter would be good as well, but we want something embedded with no external depencies so for most cases you don't need any other systems.
          Hide
          djfobbz Rick Sandhu added a comment -

          +1 for embedded full text search functionality

          Show
          djfobbz Rick Sandhu added a comment - +1 for embedded full text search functionality
          Hide
          djfobbz Rick Sandhu added a comment - - edited

          @Eliot are you still looking for sample datasets with test cases? how large of a dataset would u like?

          Show
          djfobbz Rick Sandhu added a comment - - edited @Eliot are you still looking for sample datasets with test cases? how large of a dataset would u like?
          Hide
          plasma Andrew Armstrong added a comment -

          Unfortunately it appears as though wikipedia's data dumps are offline at the moment due to server trouble (http://download.wikimedia.org/) but its about 30GB uncompressed to grab all of the English wiki pages (no history etc) which would probably be a neat data set to test against.

          See http://en.wikipedia.org/wiki/Wikipedia:Database_download for more info, perhaps an older copy is available as a torrent/mirrored somewhere you could use.

          Show
          plasma Andrew Armstrong added a comment - Unfortunately it appears as though wikipedia's data dumps are offline at the moment due to server trouble ( http://download.wikimedia.org/ ) but its about 30GB uncompressed to grab all of the English wiki pages (no history etc) which would probably be a neat data set to test against. See http://en.wikipedia.org/wiki/Wikipedia:Database_download for more info, perhaps an older copy is available as a torrent/mirrored somewhere you could use.
          Hide
          rogerbinns Roger Binns added a comment -

          After doing some work using MongoDB stored content and using Solr (server wrapper over Lucene) as the FTS engine, these are parts of my experience that mattered the most:

          You need a list of tokenizers that run over the source data and queries. eg you need to look at quoting, embedded punctuation, capitalization etc to handle stuff like: can't. I.B.M, big:deal, will.i.am, i.cant.believe.this, up/down, PowerShot etc?

          A list of filters that work on the tokens - some would replace tokens (eg you replace tokens with upper case to be entirely lower case) while others add (eg you add double metaphone representations, stemmed). Another example is a filter that replaces "4" with "(4 OR four)". If you can apply a type to the tokens then that is great so smith becomes "(smith OR DM:SMT)" and running becomes "(running OR STEM:run)". This lets you match double metaphone against double metaphone, stems against stems without "polluting" the original words.

          Some way of boosting certain results. For example if using a music corpus then searches for "michael" should have higher matches for "Michael Jackson" than "Michael Johnson". Note that this is not the same as sorting the results since other tokens in the query also affect scoring. eg "Michael Rooty" needs to pick Michael Johnson's "Rooty Toot Toot for the Moon" and not any MJ song despite all MJ songs having a higher boost.

          Multi-valued fields need to be handled correctly. Solr/Lucene treat them as concatenated together. For example a field name: [ "Elvis Presley", "The King"] is treated as though it was "Elvis Presley The King". If your query is for "King" then they treat that as matching one of the four tokens whereas it matches one out of the three tokens of "King of Siam". The BM25 style weighting scheme (or something substantially similar) everyone uses takes the document length into account which causes problems with the concatenation. Of course there will some documents where the multi-values should be concatenated and others where they are alternatives to each other as in my example.

          You need pagination of results which usually means an implementation that caches previous queries to quickly return later parts.

          Query debugging is important because you'll end up with unanticipated matches/scores and want to know why. In Solr you add a query parameter (debugQuery=true) and it returns that information. You can see what the tokenization and filtering did. You can then also see for each result how the final score was computed (a tree structure).

          Show
          rogerbinns Roger Binns added a comment - After doing some work using MongoDB stored content and using Solr (server wrapper over Lucene) as the FTS engine, these are parts of my experience that mattered the most: You need a list of tokenizers that run over the source data and queries. eg you need to look at quoting, embedded punctuation, capitalization etc to handle stuff like: can't. I.B.M, big:deal, will.i.am, i.cant.believe.this, up/down, PowerShot etc? A list of filters that work on the tokens - some would replace tokens (eg you replace tokens with upper case to be entirely lower case) while others add (eg you add double metaphone representations, stemmed). Another example is a filter that replaces "4" with "(4 OR four)". If you can apply a type to the tokens then that is great so smith becomes "(smith OR DM:SMT)" and running becomes "(running OR STEM:run)". This lets you match double metaphone against double metaphone, stems against stems without "polluting" the original words. Some way of boosting certain results. For example if using a music corpus then searches for "michael" should have higher matches for "Michael Jackson" than "Michael Johnson". Note that this is not the same as sorting the results since other tokens in the query also affect scoring. eg "Michael Rooty" needs to pick Michael Johnson's "Rooty Toot Toot for the Moon" and not any MJ song despite all MJ songs having a higher boost. Multi-valued fields need to be handled correctly. Solr/Lucene treat them as concatenated together. For example a field name: [ "Elvis Presley", "The King"] is treated as though it was "Elvis Presley The King". If your query is for "King" then they treat that as matching one of the four tokens whereas it matches one out of the three tokens of "King of Siam". The BM25 style weighting scheme (or something substantially similar) everyone uses takes the document length into account which causes problems with the concatenation. Of course there will some documents where the multi-values should be concatenated and others where they are alternatives to each other as in my example. You need pagination of results which usually means an implementation that caches previous queries to quickly return later parts. Query debugging is important because you'll end up with unanticipated matches/scores and want to know why. In Solr you add a query parameter (debugQuery=true) and it returns that information. You can see what the tokenization and filtering did. You can then also see for each result how the final score was computed (a tree structure).
          Hide
          eliot Eliot Horowitz added a comment -

          We have some more ideas here.
          Seems likely to be in 2.2

          Show
          eliot Eliot Horowitz added a comment - We have some more ideas here. Seems likely to be in 2.2
          Hide
          tshawkins Tim Hawkins added a comment - - edited

          Check out ElasticSearch, Lucene Based, but fully utf-8, REST and JSON based. In most cases you can just pull a record from mongo and push it straight to ES just removing the "_id" field. ES then generates its own _id field (actually called "_id" too) which can be the contents of your original mongo record id as a string (you just need to put the MongoId as the identity on the PUT call to add the record to the database) Supports embedded documents, uses same dot notation for specifying embedded doc members.

          Supports incremental updates, sharded indices, faceted search.

          Working with ES "feels" just like working with mongo, simular interfaces, simular simplicity.

          http://www.elasticsearch.com/

          Im working on an oplog observer that will allow ES to track changes to Mongodb collections

          Show
          tshawkins Tim Hawkins added a comment - - edited Check out ElasticSearch, Lucene Based, but fully utf-8, REST and JSON based. In most cases you can just pull a record from mongo and push it straight to ES just removing the "_id" field. ES then generates its own _id field (actually called "_id" too) which can be the contents of your original mongo record id as a string (you just need to put the MongoId as the identity on the PUT call to add the record to the database) Supports embedded documents, uses same dot notation for specifying embedded doc members. Supports incremental updates, sharded indices, faceted search. Working with ES "feels" just like working with mongo, simular interfaces, simular simplicity. http://www.elasticsearch.com/ Im working on an oplog observer that will allow ES to track changes to Mongodb collections
          Hide
          gerhardb Gerhard Balthasar added a comment -

          Just stumpled over ElasticSearch, seems like the perfect addin for mongodb, or what about merging? ElasticMongo sound good in my ears..

          Show
          gerhardb Gerhard Balthasar added a comment - Just stumpled over ElasticSearch, seems like the perfect addin for mongodb, or what about merging? ElasticMongo sound good in my ears..
          Hide
          bryangreen Bryan C Green added a comment -

          I am also interested in the solution involving ElasticSearch. I've converted from MySQL to PostgreSQL to ElasticSearch for fulltext search. I think I'm going to use ES + mongodb + PostgreSQL (for now.) I'd like some more elasticsearch rivers...

          Show
          bryangreen Bryan C Green added a comment - I am also interested in the solution involving ElasticSearch. I've converted from MySQL to PostgreSQL to ElasticSearch for fulltext search. I think I'm going to use ES + mongodb + PostgreSQL (for now.) I'd like some more elasticsearch rivers...
          Hide
          hbf Kaspar Fischer added a comment -

          I am also interested in this. If I understand correctly, a (non-application-level) integration of ElasticSearch (or Solr, or similar technology) would be much easier if triggers were available (https://jira.mongodb.org/browse/SERVER-124).

          Show
          hbf Kaspar Fischer added a comment - I am also interested in this. If I understand correctly, a (non-application-level) integration of ElasticSearch (or Solr, or similar technology) would be much easier if triggers were available ( https://jira.mongodb.org/browse/SERVER-124 ).
          Hide
          walec51 Adam Walczak added a comment -

          synonym and most used word suggestion would be also nice as part of this feature

          Show
          walec51 Adam Walczak added a comment - synonym and most used word suggestion would be also nice as part of this feature
          Hide
          trbs trbs added a comment -

          I personally don't see how any non c/c++ system, specially not java/jvm, could be used as in embedded or internal searching system for MongoDB.

          It seems to me that is not worth the time talking about in this context. For these kinds of integration developing a good external framework would be
          the way forward.

          Example I'm using Xapian (a project which I do consider 'possibly embeddable by MongoDB) in several of my MongoDB projects and that works perfectly
          fine as a coupled system. When doing a highly specialized search of when MongoDB is just one of the sources you will often wind up with an externalized search
          engine anyways.

          For an integrated search I agree that something good, simple and fitting 90% of the common use cases would be much better then specialized search.

          There search features particularly interesting for MongoDB, like search queries across multiple collections. How to handle embedded searching in a sharded
          environment. Using replica sets to scale searching. Handling embedded documents and/or deep document structures. Etc...

          Show
          trbs trbs added a comment - I personally don't see how any non c/c++ system, specially not java/jvm, could be used as in embedded or internal searching system for MongoDB. It seems to me that is not worth the time talking about in this context. For these kinds of integration developing a good external framework would be the way forward. Example I'm using Xapian (a project which I do consider 'possibly embeddable by MongoDB) in several of my MongoDB projects and that works perfectly fine as a coupled system. When doing a highly specialized search of when MongoDB is just one of the sources you will often wind up with an externalized search engine anyways. For an integrated search I agree that something good, simple and fitting 90% of the common use cases would be much better then specialized search. There search features particularly interesting for MongoDB, like search queries across multiple collections. How to handle embedded searching in a sharded environment. Using replica sets to scale searching. Handling embedded documents and/or deep document structures. Etc...
          Hide
          rgpublic rgpublic added a comment -

          OK. Don't want to add too much spam, but since everyone seems to be writing about their ideas on this top-voted issue, here goes:

          For our company, what's really great about MongoDB is its innovation that goes along the lines of: Create a database that finally does what IMHO it should have done all the years before i.e. dont add lots of additional work for a developer (fighting with SQL statements, data types, etc), but instead make it as easy as possible to 1) get data in 2) get data out (by searching for it). Consequently, for a full text solution to be "Mongo-like", I guess it should most importantly be seamless! The user shouldnt notice there is an external engine neither during installation nor during daily work. There shouldnt be any difference in use between a normal index and a fulltext index. You should be able to create and query them just like any other index. I don't think proposed solutions here that aim to couple projects like ElasticSearch (especially not projects of a different programming language like Java) would ever be able to meet that criterion properly. Even worse, if they are kept in sync via triggers or the like. I might be wrong, but I anticipate lots of problems if MongoDB fulltext search would work like this - like index being out of sync etc. Rather, I would prefer if the fulltext index would simply work like described here (more features added later step by step):

          http://api.mongodb.org/wiki/current/Full%20Text%20Search%20in%20Mongo.html

          Only the specific details (like having to create an additional field with the words-array) should be "hidden" from the user. If one could create functional indexes (different issue) then at least one would be able to create such an index on an array of words easier without those ugly auxiliary fields.

          Show
          rgpublic rgpublic added a comment - OK. Don't want to add too much spam, but since everyone seems to be writing about their ideas on this top-voted issue, here goes: For our company, what's really great about MongoDB is its innovation that goes along the lines of: Create a database that finally does what IMHO it should have done all the years before i.e. dont add lots of additional work for a developer (fighting with SQL statements, data types, etc), but instead make it as easy as possible to 1) get data in 2) get data out (by searching for it). Consequently, for a full text solution to be "Mongo-like", I guess it should most importantly be seamless ! The user shouldnt notice there is an external engine neither during installation nor during daily work. There shouldnt be any difference in use between a normal index and a fulltext index. You should be able to create and query them just like any other index. I don't think proposed solutions here that aim to couple projects like ElasticSearch (especially not projects of a different programming language like Java) would ever be able to meet that criterion properly. Even worse, if they are kept in sync via triggers or the like. I might be wrong, but I anticipate lots of problems if MongoDB fulltext search would work like this - like index being out of sync etc. Rather, I would prefer if the fulltext index would simply work like described here (more features added later step by step): http://api.mongodb.org/wiki/current/Full%20Text%20Search%20in%20Mongo.html Only the specific details (like having to create an additional field with the words-array) should be "hidden" from the user. If one could create functional indexes (different issue) then at least one would be able to create such an index on an array of words easier without those ugly auxiliary fields.
          Hide
          hbf Kaspar Fischer added a comment -

          From the issue comments given so far, I read that lots of people want a simple, out-of-the-box full-text search solution. On the other hand, others want more advanced, external search solutions to be integrated. These seem to be two different concerns.

          If I understand correctly then both concerns need support by MongoDB (which is not yet available, or underway): namely that a (simple or external) search solution needs to be able to learn when to index what and when to update the index. Maybe such a layer could be added and afterwards people can come up with different "plugins" that realize out-of-the-box search/integration with external search solutions? I would be happy to work on the latter but to do so, I would very much welcome an API where my search solutions can plug in (otherwise I will have to be a MongoDB internals expert).

          Show
          hbf Kaspar Fischer added a comment - From the issue comments given so far, I read that lots of people want a simple, out-of-the-box full-text search solution. On the other hand, others want more advanced, external search solutions to be integrated. These seem to be two different concerns. If I understand correctly then both concerns need support by MongoDB (which is not yet available, or underway): namely that a (simple or external) search solution needs to be able to learn when to index what and when to update the index. Maybe such a layer could be added and afterwards people can come up with different "plugins" that realize out-of-the-box search/integration with external search solutions? I would be happy to work on the latter but to do so, I would very much welcome an API where my search solutions can plug in (otherwise I will have to be a MongoDB internals expert).
          Hide
          felixgao Felix Gao added a comment -

          It is been over 2 years now, what is the current status of this ticket?

          Show
          felixgao Felix Gao added a comment - It is been over 2 years now, what is the current status of this ticket?
          Hide
          djfobbz Rick Sandhu added a comment -

          is this ticket dead? been over 2 years with no resolutions...admins pls update! thx

          Show
          djfobbz Rick Sandhu added a comment - is this ticket dead? been over 2 years with no resolutions...admins pls update! thx
          Hide
          eliot Eliot Horowitz added a comment -

          Not dead - just hasn't been implemented yet.

          Show
          eliot Eliot Horowitz added a comment - Not dead - just hasn't been implemented yet.
          Hide
          dlee David Lee added a comment -

          Is there a design waiting to be implemented? Or is there no accepted design proposal yet?

          It looks like this feature request is in Planning Bucket A. What does the timeline look like for Planning Bucket A?

          Show
          dlee David Lee added a comment - Is there a design waiting to be implemented? Or is there no accepted design proposal yet? It looks like this feature request is in Planning Bucket A. What does the timeline look like for Planning Bucket A?
          Hide
          eliot Eliot Horowitz added a comment -

          There is a likely design.

          There are just more pressing things to work on at the moment.

          We're this makes it in 2012

          Show
          eliot Eliot Horowitz added a comment - There is a likely design. There are just more pressing things to work on at the moment. We're this makes it in 2012
          Hide
          rubo Rouben Meschian added a comment -

          Having the ability to perform full text search across the mongodb collections would be amazing.
          This is a major feature that is missing from this system.

          Show
          rubo Rouben Meschian added a comment - Having the ability to perform full text search across the mongodb collections would be amazing. This is a major feature that is missing from this system.
          Hide
          skall.paul@gmail.com Sougata Pal added a comment - - edited

          MongoLantern can also be an good option for fulltext searching directly with mongodb without installing any new software. Though it's having few optimization issues for very large database. Hope it will be fixed in later versions.

          http://sourceforge.net/projects/mongolantern/

          Show
          skall.paul@gmail.com Sougata Pal added a comment - - edited MongoLantern can also be an good option for fulltext searching directly with mongodb without installing any new software. Though it's having few optimization issues for very large database. Hope it will be fixed in later versions. http://sourceforge.net/projects/mongolantern/
          Hide
          glenn Glenn Maynard added a comment -

          -1 to integrating an actual FTS engine into Mongo. Mongo should provide generic building blocks for implementing features. Complex, domain-specific features like FTS should be layered on top of it.

          Features like plugins that can receive notification of changes so they can update indexes sounds more reasonable.

          Show
          glenn Glenn Maynard added a comment - -1 to integrating an actual FTS engine into Mongo. Mongo should provide generic building blocks for implementing features. Complex, domain-specific features like FTS should be layered on top of it. Features like plugins that can receive notification of changes so they can update indexes sounds more reasonable.
          Hide
          turneliusz Tuner added a comment -

          @Glenn Maynard, I don't agree. What about map/reduce in queries using in the same time fulltext?

          Show
          turneliusz Tuner added a comment - @Glenn Maynard, I don't agree. What about map/reduce in queries using in the same time fulltext?
          Hide
          rgpublic rgpublic added a comment -

          I don't agree either. Not only map/reduce but basically any search using a combination of an ordinary query and fulltext search would be impossible. In fact, you can already install say ElasticSearch and keep that in sync with MongoDB manually. The problem comes up if you want to run a combined query. You need to learn and use two completely different query languages, download all results (possibly a lot) to the client and recombine them there.

          BTW: If SERVER-153 ever gets implemented before this we could create an index on 'field.split(" ")' and query that to get a simple fulltext search Just a crazy thought...

          Show
          rgpublic rgpublic added a comment - I don't agree either. Not only map/reduce but basically any search using a combination of an ordinary query and fulltext search would be impossible. In fact, you can already install say ElasticSearch and keep that in sync with MongoDB manually. The problem comes up if you want to run a combined query. You need to learn and use two completely different query languages, download all results (possibly a lot) to the client and recombine them there. BTW: If SERVER-153 ever gets implemented before this we could create an index on 'field.split(" ")' and query that to get a simple fulltext search Just a crazy thought...
          Hide
          turneliusz Tuner added a comment -

          @rgpublic, I'm actually using this solution and it's PAIN to have translator of queries to two databases, to MongoDB and ElasticSearch. The synchronization is done manually, ElasticSearch needs MORE fields for example to sort on RAW data - because ES is just a bunch of indexes by default which can't be even sorted by original value. So it's making queries translator EVEN MORE complex. Please guys, don't give me examples of how easy it is to make bridge between MongoDB and other full-text engine. It's not. It's not if you are doing more serious queries than "find me people that have foo bar in their description". So please don't block this idea. I believe and it's practically true, that there is no serious database system without simple full-text capabilities. MongoDB started to be "best of both worlds" and it should grow it's functionality to provide functionality of already existing and proven solutions. It's not a accident that other databases have full-text. Really, it's not.

          Show
          turneliusz Tuner added a comment - @rgpublic, I'm actually using this solution and it's PAIN to have translator of queries to two databases, to MongoDB and ElasticSearch. The synchronization is done manually, ElasticSearch needs MORE fields for example to sort on RAW data - because ES is just a bunch of indexes by default which can't be even sorted by original value. So it's making queries translator EVEN MORE complex. Please guys, don't give me examples of how easy it is to make bridge between MongoDB and other full-text engine. It's not. It's not if you are doing more serious queries than "find me people that have foo bar in their description". So please don't block this idea. I believe and it's practically true, that there is no serious database system without simple full-text capabilities. MongoDB started to be "best of both worlds" and it should grow it's functionality to provide functionality of already existing and proven solutions. It's not a accident that other databases have full-text. Really, it's not.
          Hide
          darkstar Pawel Krakowiak added a comment -

          I also use ElasticSearch alongside MongoDB. I had to install & configure it just to be able to do some keyword based searches (the workaround with arrays in MongoDB was not good enough due to text size in the database). I had to write additional code in my web app just to make the full text search work. I have to constantly synchronize the documents between Mongo and ES and bend over backwards to make the queries work (there are geospatial queries involved, ES & Mongo are not 100% on the same page). All I need is to be able to do full text queries on a couple fields. This was a serious letdown for me. Now I have to maintain & support two services just to get some queries to work. I'm dying to see this feature implemented.

          Show
          darkstar Pawel Krakowiak added a comment - I also use ElasticSearch alongside MongoDB. I had to install & configure it just to be able to do some keyword based searches (the workaround with arrays in MongoDB was not good enough due to text size in the database). I had to write additional code in my web app just to make the full text search work. I have to constantly synchronize the documents between Mongo and ES and bend over backwards to make the queries work (there are geospatial queries involved, ES & Mongo are not 100% on the same page). All I need is to be able to do full text queries on a couple fields. This was a serious letdown for me. Now I have to maintain & support two services just to get some queries to work. I'm dying to see this feature implemented.
          Hide
          chengas123 Ben McCann added a comment -

          Elastic search would be FAR easier to use with MongoDB if there were an elasticsearch river for MongoDB. There'd need to be a trigger or post-commit hook (https://jira.mongodb.org/browse/SERVER-124) for that to happen though.

          Show
          chengas123 Ben McCann added a comment - Elastic search would be FAR easier to use with MongoDB if there were an elasticsearch river for MongoDB. There'd need to be a trigger or post-commit hook ( https://jira.mongodb.org/browse/SERVER-124 ) for that to happen though.
          Hide
          tegan Tegan Clark added a comment -

          Any update on when full-text search is going to start appearing in nightly builds? Is this still realistic for 2012 as stated in the Nov 16 2011 comment? Thanks!

          Show
          tegan Tegan Clark added a comment - Any update on when full-text search is going to start appearing in nightly builds? Is this still realistic for 2012 as stated in the Nov 16 2011 comment? Thanks!
          Hide
          djfobbz Rick Sandhu added a comment -

          where do we stand on this? no response from development team is aggravating to say the least.

          Show
          djfobbz Rick Sandhu added a comment - where do we stand on this? no response from development team is aggravating to say the least.
          Hide
          eliot Eliot Horowitz added a comment -

          There is still no firm date for when this is going to be done, but its definitely on the short list of new features.

          Show
          eliot Eliot Horowitz added a comment - There is still no firm date for when this is going to be done, but its definitely on the short list of new features.
          Hide
          artem Artem added a comment -

          Do you plan to make this feature in 2012 ?

          Show
          artem Artem added a comment - Do you plan to make this feature in 2012 ?
          Hide
          tegan Tegan Clark added a comment -

          Any update on when work will start on this? I'm desperate for it!

          Show
          tegan Tegan Clark added a comment - Any update on when work will start on this? I'm desperate for it!
          Hide
          tegan Tegan Clark added a comment -

          In my mind this is heading into the very definition of Vapor-Ware; something that's been promised from Oct 2009 and still hasn't even hit nightly builds in Nov 2012! It all feels a little like wordage to convince you to not head towards couch or riak.

          Now I've got to go figure out how to try and bridge Mongo and elasticsearch! It's going to get ugly!

          Show
          tegan Tegan Clark added a comment - In my mind this is heading into the very definition of Vapor-Ware; something that's been promised from Oct 2009 and still hasn't even hit nightly builds in Nov 2012! It all feels a little like wordage to convince you to not head towards couch or riak. Now I've got to go figure out how to try and bridge Mongo and elasticsearch! It's going to get ugly!
          Hide
          renctan Randolph Tan added a comment -

          Hi Tegan,

          While the feature is not yet ready and if you plan to use elastic search, I encourage you to check out the mongo-connector project:

          http://blog.mongodb.org/post/29127828146/introducing-mongo-connector

          Show
          renctan Randolph Tan added a comment - Hi Tegan, While the feature is not yet ready and if you plan to use elastic search, I encourage you to check out the mongo-connector project: http://blog.mongodb.org/post/29127828146/introducing-mongo-connector
          Hide
          georges.polyzois@gmail.com Georges Polyzois added a comment -

          We use elasticsearch and mongo and it works great so far. Queries are really fast even with 100 millions of docuements. The feature set and ease of use are really great - and

          There is some pain of course in keeping data in sync between the two. One might even consider just using ES depending on use case

          Good luck

          Show
          georges.polyzois@gmail.com Georges Polyzois added a comment - We use elasticsearch and mongo and it works great so far. Queries are really fast even with 100 millions of docuements. The feature set and ease of use are really great - and There is some pain of course in keeping data in sync between the two. One might even consider just using ES depending on use case Good luck
          Hide
          tegan Tegan Clark added a comment - - edited

          @Randolph Tan; firstly, thank you very much for a pointer to a very valid option I didn't know existed. Again Thanks, and I mean that.

          It somehow just smells wrong though that you guys are engineering integration points to services that this feature would presumably make redundant?

          Or am I just way off base?

          Show
          tegan Tegan Clark added a comment - - edited @Randolph Tan; firstly, thank you very much for a pointer to a very valid option I didn't know existed. Again Thanks, and I mean that. It somehow just smells wrong though that you guys are engineering integration points to services that this feature would presumably make redundant? Or am I just way off base?
          Hide
          vilhelmk Vilhelm K. Vardøy added a comment -

          IMHO a integration to a proper search engine is a far better approach than implementing a feature like this. If full-text gets implemented in MongoDB I very much doubt it ever will be as flexible and feature rich as something that's specialized for the job.

          Even though for example MySQL and other databases has full-text people have chosen to use other tools for search for ages, just because of this. I probably would choose something like ElasticSearch again if mongodb were to support basic full-text search, because of the extended functionality ES gives me. If my needs of full-text-search wasn't big, maybe a sentence here and there, I would probably just do the multikeys-thing though.

          Roughly: It's about choosing the right tool for the job, where each tool is specialized for each task.

          Show
          vilhelmk Vilhelm K. Vardøy added a comment - IMHO a integration to a proper search engine is a far better approach than implementing a feature like this. If full-text gets implemented in MongoDB I very much doubt it ever will be as flexible and feature rich as something that's specialized for the job. Even though for example MySQL and other databases has full-text people have chosen to use other tools for search for ages, just because of this. I probably would choose something like ElasticSearch again if mongodb were to support basic full-text search, because of the extended functionality ES gives me. If my needs of full-text-search wasn't big, maybe a sentence here and there, I would probably just do the multikeys-thing though. Roughly: It's about choosing the right tool for the job, where each tool is specialized for each task.
          Hide
          auto auto (Inactive) added a comment -

          Author:

          {u'date': u'2012-12-25T17:07:05Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

          Message: SERVER-380: Add snowball stemmer
          Branch: master
          https://github.com/mongodb/mongo/commit/d2df300721805ace411b5d1a87cb4bf6d8a51ff3

          Show
          auto auto (Inactive) added a comment - Author: {u'date': u'2012-12-25T17:07:05Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'} Message: SERVER-380 : Add snowball stemmer Branch: master https://github.com/mongodb/mongo/commit/d2df300721805ace411b5d1a87cb4bf6d8a51ff3
          Hide
          auto auto (Inactive) added a comment -

          Author:

          {u'date': u'2012-12-25T17:08:28Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

          Message: SERVER-380: Experimental text search indexing
          Branch: master
          https://github.com/mongodb/mongo/commit/f201972ecc87f099777e1c61f269998f4399caf4

          Show
          auto auto (Inactive) added a comment - Author: {u'date': u'2012-12-25T17:08:28Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'} Message: SERVER-380 : Experimental text search indexing Branch: master https://github.com/mongodb/mongo/commit/f201972ecc87f099777e1c61f269998f4399caf4
          Hide
          auto auto (Inactive) added a comment -

          Author:

          {u'date': u'2012-12-25T17:40:37Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

          Message: SERVER-380: When testing via mongos, have to enable on all shards
          Branch: master
          https://github.com/mongodb/mongo/commit/d6cf0d675c7067e1a26ae5ee4ddd6dc16ea2feb2

          Show
          auto auto (Inactive) added a comment - Author: {u'date': u'2012-12-25T17:40:37Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'} Message: SERVER-380 : When testing via mongos, have to enable on all shards Branch: master https://github.com/mongodb/mongo/commit/d6cf0d675c7067e1a26ae5ee4ddd6dc16ea2feb2
          Hide
          auto auto (Inactive) added a comment -

          Author:

          {u'date': u'2012-12-25T17:42:47Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

          Message: SERVER-380: use unique colletions for each test
          Branch: master
          https://github.com/mongodb/mongo/commit/2ae88e148381a437f6a8bdbaab09e5200cad4b68

          Show
          auto auto (Inactive) added a comment - Author: {u'date': u'2012-12-25T17:42:47Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'} Message: SERVER-380 : use unique colletions for each test Branch: master https://github.com/mongodb/mongo/commit/2ae88e148381a437f6a8bdbaab09e5200cad4b68
          Hide
          auto auto (Inactive) added a comment -

          Author:

          {u'date': u'2012-12-26T01:00:42Z', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}

          Message: SERVER-380 Fix access violation on Windows

          Make copy of StringData passed to Tokenizer in case the original
          was in a temporary.
          Branch: master
          https://github.com/mongodb/mongo/commit/80507d135a22723e95f5b1be9938c3bf5309d273

          Show
          auto auto (Inactive) added a comment - Author: {u'date': u'2012-12-26T01:00:42Z', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'} Message: SERVER-380 Fix access violation on Windows Make copy of StringData passed to Tokenizer in case the original was in a temporary. Branch: master https://github.com/mongodb/mongo/commit/80507d135a22723e95f5b1be9938c3bf5309d273
          Hide
          us Ulf Schneider added a comment -

          first of all: many thanks to you for implementing this feature. now, as i can see in what direction you are working, i would like to ask the following question:
          i'm storing file attachments of various formats (pdf, ms office, images and so on) via fsgrid and i can parse those attachments with apache tika for indexable text content. i get this indexable text content as a set of strings. as you can think of, the resulting sets could be large (up to megabytes). is it a suitable use case, to store those parsing results inside a field that will be text-indexed the way that you have implemented now?

          Show
          us Ulf Schneider added a comment - first of all: many thanks to you for implementing this feature. now, as i can see in what direction you are working, i would like to ask the following question: i'm storing file attachments of various formats (pdf, ms office, images and so on) via fsgrid and i can parse those attachments with apache tika for indexable text content. i get this indexable text content as a set of strings. as you can think of, the resulting sets could be large (up to megabytes). is it a suitable use case, to store those parsing results inside a field that will be text-indexed the way that you have implemented now?
          Hide
          mitchitized Mitch Pirtle added a comment - - edited

          Wanting to log a use case that I suspect will be a bigger issue moving forward - we created a lithium plugin for is_translatable behavior, where a single document can have embedded properties for each language supported by a given app. This is also starting to crop up in the rails world, and I suspect it will grow in other php frameworks as well.

          One of the major draws to MongoDB is the document model, and forcing the language at the top level is counter to that approach (at least it is for some of us).

          Would it be too difficult to be able to index nested properties by language, instead of forcing the language for the entire document?

          For example:

          {
            name : "Mitch Pirtle",
            location : [
            {
              "language" : "English",
              "country" : "Italy",
              "city" : "Turin",
              "profile" : "I write code and play bass guitar."
            },
            {
              "language" : "Italian",
              "country" : "Italia",
              "city" : "Torino",
              "profile" : "Scrivo il codice e il basso suonare la chitarra."
            }
          }

          Can I index location.profile based on the specified language? Just an example, but hopefully gets my request across clearly.

          Show
          mitchitized Mitch Pirtle added a comment - - edited Wanting to log a use case that I suspect will be a bigger issue moving forward - we created a lithium plugin for is_translatable behavior, where a single document can have embedded properties for each language supported by a given app. This is also starting to crop up in the rails world, and I suspect it will grow in other php frameworks as well. One of the major draws to MongoDB is the document model, and forcing the language at the top level is counter to that approach (at least it is for some of us). Would it be too difficult to be able to index nested properties by language, instead of forcing the language for the entire document? For example: { name : "Mitch Pirtle", location : [ { "language" : "English", "country" : "Italy", "city" : "Turin", "profile" : "I write code and play bass guitar." }, { "language" : "Italian", "country" : "Italia", "city" : "Torino", "profile" : "Scrivo il codice e il basso suonare la chitarra." } } Can I index location.profile based on the specified language? Just an example, but hopefully gets my request across clearly.
          Hide
          eliot Eliot Horowitz added a comment -

          Mitch, definitely not for 2.4, but certainly likely in the future.
          Can you open a new ticket for that?

          Show
          eliot Eliot Horowitz added a comment - Mitch, definitely not for 2.4, but certainly likely in the future. Can you open a new ticket for that?
          Hide
          mitchitized Mitch Pirtle added a comment -

          Will do, thanks for the encouraging words.

          Show
          mitchitized Mitch Pirtle added a comment - Will do, thanks for the encouraging words.
          Hide
          marians Marian Steinbach added a comment - - edited

          Great to see full text search come to life!

          I'd have questions on the German stop word list (https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/stop_words_german.txt). What basis has it been generated from? Where would I add suggestions for improvement?

          As it is currently, it contains nouns that definitely shouldn't be in a default stop word list, e.g. "nutzung" (usage), "schreiben" (letter / writing), "arbeiten" (works), "mann" (man), "ehe" (marriage), "frau" (woman), "bedarf" (need), "ende" (end), "fall" (case) etc.

          Should I open a new ticket or directly send a pull request?

          Show
          marians Marian Steinbach added a comment - - edited Great to see full text search come to life! I'd have questions on the German stop word list ( https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/stop_words_german.txt ). What basis has it been generated from? Where would I add suggestions for improvement? As it is currently, it contains nouns that definitely shouldn't be in a default stop word list, e.g. "nutzung" (usage), "schreiben" (letter / writing), "arbeiten" (works), "mann" (man), "ehe" (marriage), "frau" (woman), "bedarf" (need), "ende" (end), "fall" (case) etc. Should I open a new ticket or directly send a pull request?
          Hide
          turneliusz Tuner added a comment -

          No Polish support ;( Can I help somehow with that?

          Show
          turneliusz Tuner added a comment - No Polish support ;( Can I help somehow with that?
          Hide
          dan@10gen.com Dan Pasette added a comment - - edited

          Marian Steinbach, can you open a new ticket for the German stop word list? It would be helpful to get your feedback.

          Tuner, can you also open a new ticket for Polish support?

          Show
          dan@10gen.com Dan Pasette added a comment - - edited Marian Steinbach , can you open a new ticket for the German stop word list? It would be helpful to get your feedback. Tuner , can you also open a new ticket for Polish support?
          Hide
          turneliusz Tuner added a comment -

          Dan Pasette, done

          Show
          turneliusz Tuner added a comment - Dan Pasette, done
          Hide
          marians Marian Steinbach added a comment -

          Issue on German stop word list created as SERVER-8334

          Show
          marians Marian Steinbach added a comment - Issue on German stop word list created as SERVER-8334
          Hide
          steve.schlotter Steve Schlotter added a comment -

          Wow wow wow! Thank you for this feature! Is it on the road map to return a cursor instead of a document?

          Show
          steve.schlotter Steve Schlotter added a comment - Wow wow wow! Thank you for this feature! Is it on the road map to return a cursor instead of a document?
          Hide
          eliot Eliot Horowitz added a comment -

          steve - yes, likely to be in 2.6 (returning a cursor and/or including in normal query language).

          Show
          eliot Eliot Horowitz added a comment - steve - yes, likely to be in 2.6 (returning a cursor and/or including in normal query language).
          Hide
          alainc Alain Cordier added a comment - - edited

          I think the french stop word list is also not as accurate as it can (contains some nouns and a lot are missing).
          You can find a good one (at least for french, but I suppose some others...) here

          Show
          alainc Alain Cordier added a comment - - edited I think the french stop word list is also not as accurate as it can (contains some nouns and a lot are missing). You can find a good one (at least for french, but I suppose some others...) here
          Hide
          bkostadinovic Bojan Kostadinovic added a comment -

          Eliot is there an option to do full text search but sort per field instead of score?

          Show
          bkostadinovic Bojan Kostadinovic added a comment - Eliot is there an option to do full text search but sort per field instead of score?
          Hide
          rassi@10gen.com Jason Rassi added a comment -
          Show
          rassi@10gen.com Jason Rassi added a comment - Bojan Kostadinovic : yes, see SERVER-9392 .
          Hide
          bkostadinovic Bojan Kostadinovic added a comment -

          Ok in other words "No, but hopefully coming with version 2.5" since that one is Unresolved

          Show
          bkostadinovic Bojan Kostadinovic added a comment - Ok in other words "No, but hopefully coming with version 2.5" since that one is Unresolved
          Hide
          rassi@10gen.com Jason Rassi added a comment -

          Correct, that's the in-development syntax Eliot was referring to in his comment "yes, likely to be in 2.6".

          Show
          rassi@10gen.com Jason Rassi added a comment - Correct, that's the in-development syntax Eliot was referring to in his comment "yes, likely to be in 2.6".

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              2 years, 13 weeks ago
              Date of 1st Reply: