[SERVER-380] basic text indexing and search Created: 21/Oct/09 Updated: 07/Apr/23 Resolved: 07/Jan/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Querying |
| Affects Version/s: | None |
| Fix Version/s: | 2.3.2 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Raj Kadam | Assignee: | Eliot Horowitz (Inactive) |
| Resolution: | Done | Votes: | 239 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Participants: |
Adam Walczak, Alain Cordier, Alan Wright, Andrew Armstrong, Andrew G, Artem, auto, Ben McCann, Bojan Kostadinovic, Bryan C Green, dan, Daniel Pasette, David James, David Lee, David Lehmann, Eliot Horowitz, Eric Mill, Felix Gao, Georges Polyzois, Gerhard Balthasar, gf, Glenn Maynard, huangzhijian, Jeremy Hinegardner, Johan Bergström, J Rassi, Kaspar Fischer, Marian Steinbach, Matt L, Michael Stephens, Mitch Pirtle, Nicolas Fouché, Pawel Krakowiak, Raj Kadam, Randolph Tan, rgpublic, Richard Boulton, Rick Sandhu, Rob, Roger Binns, Rouben Meschian, Sebastian Friedel, Sougata Pal, Steve Schlotter, Tegan Clark, Tim Hawkins, trbs, Tuner, Ulf Schneider, Vilhelm K. Vardøy, Walt Woods
|
||||||||||||
| Description |
|
Simple text indexing.
OR
Only works via "text" command currently. Features:
Simple Example:
|
| Comments |
| Comment by J Rassi [ 29/Apr/13 ] | ||||||||||||||||
|
Correct, that's the in-development syntax Eliot was referring to in his comment "yes, likely to be in 2.6". | ||||||||||||||||
| Comment by Bojan Kostadinovic [ 29/Apr/13 ] | ||||||||||||||||
|
Ok in other words "No, but hopefully coming with version 2.5" since that one is Unresolved | ||||||||||||||||
| Comment by J Rassi [ 29/Apr/13 ] | ||||||||||||||||
|
bkostadinovic: yes, see | ||||||||||||||||
| Comment by Bojan Kostadinovic [ 29/Apr/13 ] | ||||||||||||||||
|
Eliot is there an option to do full text search but sort per field instead of score? | ||||||||||||||||
| Comment by Alain Cordier [ 06/Feb/13 ] | ||||||||||||||||
|
I think the french stop word list is also not as accurate as it can (contains some nouns and a lot are missing). | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 25/Jan/13 ] | ||||||||||||||||
|
steve - yes, likely to be in 2.6 (returning a cursor and/or including in normal query language). | ||||||||||||||||
| Comment by Steve Schlotter [ 25/Jan/13 ] | ||||||||||||||||
|
Wow wow wow! Thank you for this feature! Is it on the road map to return a cursor instead of a document? | ||||||||||||||||
| Comment by Marian Steinbach [ 25/Jan/13 ] | ||||||||||||||||
|
Issue on German stop word list created as | ||||||||||||||||
| Comment by Tuner [ 25/Jan/13 ] | ||||||||||||||||
|
Dan Pasette, done | ||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 25/Jan/13 ] | ||||||||||||||||
|
marians, can you open a new ticket for the German stop word list? It would be helpful to get your feedback. turneliusz, can you also open a new ticket for Polish support? | ||||||||||||||||
| Comment by Tuner [ 25/Jan/13 ] | ||||||||||||||||
|
No Polish support ;( Can I help somehow with that? | ||||||||||||||||
| Comment by Marian Steinbach [ 25/Jan/13 ] | ||||||||||||||||
|
Great to see full text search come to life! I'd have questions on the German stop word list (https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/stop_words_german.txt). What basis has it been generated from? Where would I add suggestions for improvement? As it is currently, it contains nouns that definitely shouldn't be in a default stop word list, e.g. "nutzung" (usage), "schreiben" (letter / writing), "arbeiten" (works), "mann" (man), "ehe" (marriage), "frau" (woman), "bedarf" (need), "ende" (end), "fall" (case) etc. Should I open a new ticket or directly send a pull request? | ||||||||||||||||
| Comment by Mitch Pirtle [ 10/Jan/13 ] | ||||||||||||||||
|
Will do, thanks for the encouraging words. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/Jan/13 ] | ||||||||||||||||
|
Mitch, definitely not for 2.4, but certainly likely in the future. | ||||||||||||||||
| Comment by Mitch Pirtle [ 10/Jan/13 ] | ||||||||||||||||
|
Wanting to log a use case that I suspect will be a bigger issue moving forward - we created a lithium plugin for is_translatable behavior, where a single document can have embedded properties for each language supported by a given app. This is also starting to crop up in the rails world, and I suspect it will grow in other php frameworks as well. One of the major draws to MongoDB is the document model, and forcing the language at the top level is counter to that approach (at least it is for some of us). Would it be too difficult to be able to index nested properties by language, instead of forcing the language for the entire document? For example:
Can I index location.profile based on the specified language? Just an example, but hopefully gets my request across clearly. | ||||||||||||||||
| Comment by Ulf Schneider [ 10/Jan/13 ] | ||||||||||||||||
|
first of all: many thanks to you for implementing this feature. now, as i can see in what direction you are working, i would like to ask the following question: | ||||||||||||||||
| Comment by auto [ 26/Dec/12 ] | ||||||||||||||||
|
Author: {u'date': u'2012-12-26T01:00:42Z', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}Message: Make copy of StringData passed to Tokenizer in case the original | ||||||||||||||||
| Comment by auto [ 25/Dec/12 ] | ||||||||||||||||
|
Author: {u'date': u'2012-12-25T17:42:47Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}Message: | ||||||||||||||||
| Comment by auto [ 25/Dec/12 ] | ||||||||||||||||
|
Author: {u'date': u'2012-12-25T17:40:37Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}Message: | ||||||||||||||||
| Comment by auto [ 25/Dec/12 ] | ||||||||||||||||
|
Author: {u'date': u'2012-12-25T17:08:28Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}Message: | ||||||||||||||||
| Comment by auto [ 25/Dec/12 ] | ||||||||||||||||
|
Author: {u'date': u'2012-12-25T17:07:05Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}Message: | ||||||||||||||||
| Comment by Vilhelm K. Vardøy [ 18/Nov/12 ] | ||||||||||||||||
|
IMHO a integration to a proper search engine is a far better approach than implementing a feature like this. If full-text gets implemented in MongoDB I very much doubt it ever will be as flexible and feature rich as something that's specialized for the job. Even though for example MySQL and other databases has full-text people have chosen to use other tools for search for ages, just because of this. I probably would choose something like ElasticSearch again if mongodb were to support basic full-text search, because of the extended functionality ES gives me. If my needs of full-text-search wasn't big, maybe a sentence here and there, I would probably just do the multikeys-thing though. Roughly: It's about choosing the right tool for the job, where each tool is specialized for each task. | ||||||||||||||||
| Comment by Tegan Clark [ 18/Nov/12 ] | ||||||||||||||||
|
@Randolph Tan; firstly, thank you very much for a pointer to a very valid option I didn't know existed. Again Thanks, and I mean that. It somehow just smells wrong though that you guys are engineering integration points to services that this feature would presumably make redundant? Or am I just way off base? | ||||||||||||||||
| Comment by Georges Polyzois [ 17/Nov/12 ] | ||||||||||||||||
|
We use elasticsearch and mongo and it works great so far. Queries are really fast even with 100 millions of docuements. The feature set and ease of use are really great - and There is some pain of course in keeping data in sync between the two. One might even consider just using ES depending on use case Good luck | ||||||||||||||||
| Comment by Randolph Tan [ 17/Nov/12 ] | ||||||||||||||||
|
Hi Tegan, While the feature is not yet ready and if you plan to use elastic search, I encourage you to check out the mongo-connector project: http://blog.mongodb.org/post/29127828146/introducing-mongo-connector | ||||||||||||||||
| Comment by Tegan Clark [ 17/Nov/12 ] | ||||||||||||||||
|
In my mind this is heading into the very definition of Vapor-Ware; something that's been promised from Oct 2009 and still hasn't even hit nightly builds in Nov 2012! It all feels a little like wordage to convince you to not head towards couch or riak. Now I've got to go figure out how to try and bridge Mongo and elasticsearch! It's going to get ugly! | ||||||||||||||||
| Comment by Tegan Clark [ 09/Nov/12 ] | ||||||||||||||||
|
Any update on when work will start on this? I'm desperate for it! | ||||||||||||||||
| Comment by Artem [ 16/Oct/12 ] | ||||||||||||||||
|
Do you plan to make this feature in 2012 ? | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/Jul/12 ] | ||||||||||||||||
|
There is still no firm date for when this is going to be done, but its definitely on the short list of new features. | ||||||||||||||||
| Comment by Rick Sandhu [ 10/Jul/12 ] | ||||||||||||||||
|
where do we stand on this? no response from development team is aggravating to say the least. | ||||||||||||||||
| Comment by Tegan Clark [ 04/Jul/12 ] | ||||||||||||||||
|
Any update on when full-text search is going to start appearing in nightly builds? Is this still realistic for 2012 as stated in the Nov 16 2011 comment? Thanks! | ||||||||||||||||
| Comment by Ben McCann [ 16/Apr/12 ] | ||||||||||||||||
|
Elastic search would be FAR easier to use with MongoDB if there were an elasticsearch river for MongoDB. There'd need to be a trigger or post-commit hook (https://jira.mongodb.org/browse/SERVER-124) for that to happen though. | ||||||||||||||||
| Comment by Pawel Krakowiak [ 05/Apr/12 ] | ||||||||||||||||
|
I also use ElasticSearch alongside MongoDB. I had to install & configure it just to be able to do some keyword based searches (the workaround with arrays in MongoDB was not good enough due to text size in the database). I had to write additional code in my web app just to make the full text search work. I have to constantly synchronize the documents between Mongo and ES and bend over backwards to make the queries work (there are geospatial queries involved, ES & Mongo are not 100% on the same page). All I need is to be able to do full text queries on a couple fields. This was a serious letdown for me. Now I have to maintain & support two services just to get some queries to work. I'm dying to see this feature implemented. | ||||||||||||||||
| Comment by Tuner [ 05/Apr/12 ] | ||||||||||||||||
|
@rgpublic, I'm actually using this solution and it's PAIN to have translator of queries to two databases, to MongoDB and ElasticSearch. The synchronization is done manually, ElasticSearch needs MORE fields for example to sort on RAW data - because ES is just a bunch of indexes by default which can't be even sorted by original value. So it's making queries translator EVEN MORE complex. Please guys, don't give me examples of how easy it is to make bridge between MongoDB and other full-text engine. It's not. It's not if you are doing more serious queries than "find me people that have foo bar in their description". So please don't block this idea. I believe and it's practically true, that there is no serious database system without simple full-text capabilities. MongoDB started to be "best of both worlds" and it should grow it's functionality to provide functionality of already existing and proven solutions. It's not a accident that other databases have full-text. Really, it's not. | ||||||||||||||||
| Comment by rgpublic [ 04/Apr/12 ] | ||||||||||||||||
|
I don't agree either. Not only map/reduce but basically any search using a combination of an ordinary query and fulltext search would be impossible. In fact, you can already install say ElasticSearch and keep that in sync with MongoDB manually. The problem comes up if you want to run a combined query. You need to learn and use two completely different query languages, download all results (possibly a lot) to the client and recombine them there. BTW: If SERVER-153 ever gets implemented before this we could create an index on 'field.split(" ")' and query that to get a simple fulltext search | ||||||||||||||||
| Comment by Tuner [ 04/Apr/12 ] | ||||||||||||||||
|
@Glenn Maynard, I don't agree. What about map/reduce in queries using in the same time fulltext? | ||||||||||||||||
| Comment by Glenn Maynard [ 04/Apr/12 ] | ||||||||||||||||
|
-1 to integrating an actual FTS engine into Mongo. Mongo should provide generic building blocks for implementing features. Complex, domain-specific features like FTS should be layered on top of it. Features like plugins that can receive notification of changes so they can update indexes sounds more reasonable. | ||||||||||||||||
| Comment by Sougata Pal [ 12/Jan/12 ] | ||||||||||||||||
|
MongoLantern can also be an good option for fulltext searching directly with mongodb without installing any new software. Though it's having few optimization issues for very large database. Hope it will be fixed in later versions. | ||||||||||||||||
| Comment by Rouben Meschian [ 12/Dec/11 ] | ||||||||||||||||
|
Having the ability to perform full text search across the mongodb collections would be amazing. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 16/Nov/11 ] | ||||||||||||||||
|
There is a likely design. There are just more pressing things to work on at the moment. We're this makes it in 2012 | ||||||||||||||||
| Comment by David Lee [ 16/Nov/11 ] | ||||||||||||||||
|
Is there a design waiting to be implemented? Or is there no accepted design proposal yet? It looks like this feature request is in Planning Bucket A. What does the timeline look like for Planning Bucket A? | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 16/Nov/11 ] | ||||||||||||||||
|
Not dead - just hasn't been implemented yet. | ||||||||||||||||
| Comment by Rick Sandhu [ 12/Nov/11 ] | ||||||||||||||||
|
is this ticket dead? been over 2 years with no resolutions...admins pls update! thx | ||||||||||||||||
| Comment by Felix Gao [ 13/Aug/11 ] | ||||||||||||||||
|
It is been over 2 years now, what is the current status of this ticket? | ||||||||||||||||
| Comment by Kaspar Fischer [ 07/Jul/11 ] | ||||||||||||||||
|
From the issue comments given so far, I read that lots of people want a simple, out-of-the-box full-text search solution. On the other hand, others want more advanced, external search solutions to be integrated. These seem to be two different concerns. If I understand correctly then both concerns need support by MongoDB (which is not yet available, or underway): namely that a (simple or external) search solution needs to be able to learn when to index what and when to update the index. Maybe such a layer could be added and afterwards people can come up with different "plugins" that realize out-of-the-box search/integration with external search solutions? I would be happy to work on the latter but to do so, I would very much welcome an API where my search solutions can plug in (otherwise I will have to be a MongoDB internals expert). | ||||||||||||||||
| Comment by rgpublic [ 07/Jul/11 ] | ||||||||||||||||
|
OK. Don't want to add too much spam, but since everyone seems to be writing about their ideas on this top-voted issue, here goes: For our company, what's really great about MongoDB is its innovation that goes along the lines of: Create a database that finally does what IMHO it should have done all the years before i.e. dont add lots of additional work for a developer (fighting with SQL statements, data types, etc), but instead make it as easy as possible to 1) get data in 2) get data out (by searching for it). Consequently, for a full text solution to be "Mongo-like", I guess it should most importantly be seamless! The user shouldnt notice there is an external engine neither during installation nor during daily work. There shouldnt be any difference in use between a normal index and a fulltext index. You should be able to create and query them just like any other index. I don't think proposed solutions here that aim to couple projects like ElasticSearch (especially not projects of a different programming language like Java) would ever be able to meet that criterion properly. Even worse, if they are kept in sync via triggers or the like. I might be wrong, but I anticipate lots of problems if MongoDB fulltext search would work like this - like index being out of sync etc. Rather, I would prefer if the fulltext index would simply work like described here (more features added later step by step): http://api.mongodb.org/wiki/current/Full%20Text%20Search%20in%20Mongo.html Only the specific details (like having to create an additional field with the words-array) should be "hidden" from the user. If one could create functional indexes (different issue) then at least one would be able to create such an index on an array of words easier without those ugly auxiliary fields. | ||||||||||||||||
| Comment by trbs [ 06/Jul/11 ] | ||||||||||||||||
|
I personally don't see how any non c/c++ system, specially not java/jvm, could be used as in embedded or internal searching system for MongoDB. It seems to me that is not worth the time talking about in this context. For these kinds of integration developing a good external framework would be Example I'm using Xapian (a project which I do consider 'possibly embeddable by MongoDB) in several of my MongoDB projects and that works perfectly For an integrated search I agree that something good, simple and fitting 90% of the common use cases would be much better then specialized search. There search features particularly interesting for MongoDB, like search queries across multiple collections. How to handle embedded searching in a sharded | ||||||||||||||||
| Comment by Adam Walczak [ 05/Jul/11 ] | ||||||||||||||||
|
synonym and most used word suggestion would be also nice as part of this feature | ||||||||||||||||
| Comment by Kaspar Fischer [ 26/Jun/11 ] | ||||||||||||||||
|
I am also interested in this. If I understand correctly, a (non-application-level) integration of ElasticSearch (or Solr, or similar technology) would be much easier if triggers were available (https://jira.mongodb.org/browse/SERVER-124). | ||||||||||||||||
| Comment by Bryan C Green [ 30/May/11 ] | ||||||||||||||||
|
I am also interested in the solution involving ElasticSearch. I've converted from MySQL to PostgreSQL to ElasticSearch for fulltext search. I think I'm going to use ES + mongodb + PostgreSQL (for now.) I'd like some more elasticsearch rivers... | ||||||||||||||||
| Comment by Gerhard Balthasar [ 12/Apr/11 ] | ||||||||||||||||
|
Just stumpled over ElasticSearch, seems like the perfect addin for mongodb, or what about merging? ElasticMongo sound good in my ears.. | ||||||||||||||||
| Comment by Tim Hawkins [ 04/Feb/11 ] | ||||||||||||||||
|
Check out ElasticSearch, Lucene Based, but fully utf-8, REST and JSON based. In most cases you can just pull a record from mongo and push it straight to ES just removing the "_id" field. ES then generates its own _id field (actually called "_id" too) which can be the contents of your original mongo record id as a string (you just need to put the MongoId as the identity on the PUT call to add the record to the database) Supports embedded documents, uses same dot notation for specifying embedded doc members. Supports incremental updates, sharded indices, faceted search. Working with ES "feels" just like working with mongo, simular interfaces, simular simplicity. Im working on an oplog observer that will allow ES to track changes to Mongodb collections | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 04/Feb/11 ] | ||||||||||||||||
|
We have some more ideas here. | ||||||||||||||||
| Comment by Roger Binns [ 15/Dec/10 ] | ||||||||||||||||
|
After doing some work using MongoDB stored content and using Solr (server wrapper over Lucene) as the FTS engine, these are parts of my experience that mattered the most: You need a list of tokenizers that run over the source data and queries. eg you need to look at quoting, embedded punctuation, capitalization etc to handle stuff like: can't. I.B.M, big:deal, will.i.am, i.cant.believe.this, up/down, PowerShot etc? A list of filters that work on the tokens - some would replace tokens (eg you replace tokens with upper case to be entirely lower case) while others add (eg you add double metaphone representations, stemmed). Another example is a filter that replaces "4" with "(4 OR four)". If you can apply a type to the tokens then that is great so smith becomes "(smith OR DM:SMT)" and running becomes "(running OR STEM:run)". This lets you match double metaphone against double metaphone, stems against stems without "polluting" the original words. Some way of boosting certain results. For example if using a music corpus then searches for "michael" should have higher matches for "Michael Jackson" than "Michael Johnson". Note that this is not the same as sorting the results since other tokens in the query also affect scoring. eg "Michael Rooty" needs to pick Michael Johnson's "Rooty Toot Toot for the Moon" and not any MJ song despite all MJ songs having a higher boost. Multi-valued fields need to be handled correctly. Solr/Lucene treat them as concatenated together. For example a field name: [ "Elvis Presley", "The King"] is treated as though it was "Elvis Presley The King". If your query is for "King" then they treat that as matching one of the four tokens whereas it matches one out of the three tokens of "King of Siam". The BM25 style weighting scheme (or something substantially similar) everyone uses takes the document length into account which causes problems with the concatenation. Of course there will some documents where the multi-values should be concatenated and others where they are alternatives to each other as in my example. You need pagination of results which usually means an implementation that caches previous queries to quickly return later parts. Query debugging is important because you'll end up with unanticipated matches/scores and want to know why. In Solr you add a query parameter (debugQuery=true) and it returns that information. You can see what the tokenization and filtering did. You can then also see for each result how the final score was computed (a tree structure). | ||||||||||||||||
| Comment by Andrew Armstrong [ 15/Dec/10 ] | ||||||||||||||||
|
Unfortunately it appears as though wikipedia's data dumps are offline at the moment due to server trouble (http://download.wikimedia.org/) but its about 30GB uncompressed to grab all of the English wiki pages (no history etc) which would probably be a neat data set to test against. See http://en.wikipedia.org/wiki/Wikipedia:Database_download for more info, perhaps an older copy is available as a torrent/mirrored somewhere you could use. | ||||||||||||||||
| Comment by Rick Sandhu [ 06/Dec/10 ] | ||||||||||||||||
|
@Eliot are you still looking for sample datasets with test cases? how large of a dataset would u like? | ||||||||||||||||
| Comment by Rick Sandhu [ 05/Dec/10 ] | ||||||||||||||||
|
+1 for embedded full text search functionality | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 12/Nov/10 ] | ||||||||||||||||
|
A good sphinx adapter would be good as well, but we want something embedded with no external depencies so for most cases you don't need any other systems. | ||||||||||||||||
| Comment by Matt L [ 12/Nov/10 ] | ||||||||||||||||
|
..and the Sphinx 1.10 (with RT) is only available as a beta (so far), and has a lot of bugs (see their's bug-tracker/forum). Tried use it on my project, but gave up. | ||||||||||||||||
| Comment by David Lee [ 12/Nov/10 ] | ||||||||||||||||
|
gf: The Sphinx real-time indexing is only available for SphinxQL, which is not a good option for MongoDB. | ||||||||||||||||
| Comment by gf [ 12/Nov/10 ] | ||||||||||||||||
|
Mitch Pirtle: Check the sphinx site. "Jul 19, 2010. Sphinx 1.10-beta is out: We're happy to release Sphinx 1.10-beta, with a new major version number that means, as promised, !!!Unable to render embedded object: File (real-time) not found.!! indexes support." | ||||||||||||||||
| Comment by Mitch Pirtle [ 12/Nov/10 ] | ||||||||||||||||
|
Sphinx doesn't have partial updates to indexes IIRC. You'd have to regenerate all indexes from scratch for every update. I've been looking into Elastic Search as a workaround for now, but other priorities keep me from getting it done. | ||||||||||||||||
| Comment by gf [ 12/Nov/10 ] | ||||||||||||||||
|
Eliot Horowitz: why don't you want to use Sphinx? Implementation is easy for sure. That's feature-rich and robust. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 12/Nov/10 ] | ||||||||||||||||
|
Stemming for sure - phrase may or not be in version 1. | ||||||||||||||||
| Comment by David Lee [ 12/Nov/10 ] | ||||||||||||||||
|
That's great! Do you plan on supporting stemming and phrase matching? | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 12/Nov/10 ] | ||||||||||||||||
|
We have a proof of concept working in the lab - but want to make sure its rock solid before releasing | ||||||||||||||||
| Comment by Roger Binns [ 08/Oct/10 ] | ||||||||||||||||
|
@Walt: There are already scoring algorithms developed and tuned over the years. For example see BM25: http://en.wikipedia.org/wiki/Okapi_BM25 BM25F would take into account multiple fields for a document (eg a title in addition to a body). There is an open source pure Python text search library Whoosh that implements this scoring algorithm hence providing some nice reference code. I believe it is also part of Xapian etc. | ||||||||||||||||
| Comment by Walt Woods [ 08/Oct/10 ] | ||||||||||||||||
|
Yeah; per-word weighting: Count occurrences of word. Multiply occurrences by weight for that word's total weight in document. Store in index. Per-document weighting: Count occurrences of a single word, divide by total occurrence count for the indexed field (e.g. # of words). Multiply this fraction by the specified index weight. Store in index. Essentially, using weights to bound the effects of any single field in the full text index with respect to the total document score. Maybe this is obvious and how it was going to be done anyway? Admissibly, I don't have much experience with other full-text indexers... just what I've done in experimenting with my own variation for awhile now. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 08/Oct/10 ] | ||||||||||||||||
|
Can you describe exactly what you mean by per word and per document | ||||||||||||||||
| Comment by Walt Woods [ 08/Oct/10 ] | ||||||||||||||||
|
@Eliot Horowitz - Ah... Do you think it would be a possibility to provide per-document weights? Titles are almost always short, but bodies are much more variable; in my opinion, it's not very fair to count a longer document as more pertinent to the subject requested. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 07/Oct/10 ] | ||||||||||||||||
|
@walt - idea is per word if understand what you mean | ||||||||||||||||
| Comment by Walt Woods [ 07/Oct/10 ] | ||||||||||||||||
|
@Eliot Horowitz - Question about your above-mentioned ensureIndex( { title: 1, body: 0.2 }, { fullTextSearch: true }) API example; are these per-word weights or per-document weights? | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 15/Sep/10 ] | ||||||||||||||||
|
Ideally a data set and some test cases. Don't need any code. | ||||||||||||||||
| Comment by Johan Bergström [ 15/Sep/10 ] | ||||||||||||||||
|
Eliot: I have a system using postgres to mongo (transition period) and xapian for full text search. What kind of input do you seek? (sorry, this isn't oss) | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 15/Sep/10 ] | ||||||||||||||||
|
Does anyone watching this case have a dataset and tests with another system (mysql,xapian,etc...) | ||||||||||||||||
| Comment by huangzhijian [ 09/Sep/10 ] | ||||||||||||||||
|
@Eliot Horowitz iam really excitely expecting the full-text search functionality , it must be fantastic if it can support chinese language | ||||||||||||||||
| Comment by Eric Mill [ 06/Sep/10 ] | ||||||||||||||||
|
@dan - I'd consider using your library if there were documentation - I have little experience with MongoDB plugins and I'm not sure how to use your code. Phrasal search is important, but I would be fine using a solution that lacked it as a stopgap. @eliot - I'll be crossing my fingers, then. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 26/Aug/10 ] | ||||||||||||||||
|
@dan - whatever we do will definitely be embedded in the db written in c++ | ||||||||||||||||
| Comment by dan [ 26/Aug/10 ] | ||||||||||||||||
|
@eric - So you didn't like our native Mongo search? Is phrasal searching the only show stopper? | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 25/Aug/10 ] | ||||||||||||||||
|
Its being considered for 1.7.x | ||||||||||||||||
| Comment by Eric Mill [ 25/Aug/10 ] | ||||||||||||||||
|
Is this still targeted for release in 1.7? I'm gauging whether I should hold out, or go try to integrate with Solor. | ||||||||||||||||
| Comment by Michael Stephens [ 09/Aug/10 ] | ||||||||||||||||
|
I've started work on a little tool (http://github.com/mikejs/photovoltaic) that uses mongo's replication internals to automatically keep a solr index up to date. It's rough around the edges but my FTS needs aren't very fancy. | ||||||||||||||||
| Comment by dan [ 07/Aug/10 ] | ||||||||||||||||
|
you can add phrasal searching to the library we have produced using simple javascript. patches welcome. we've been keen to keep the library fast and simple so far, and haven't added phrasal search ad such because a well-ranked stemming search has done the job for us very well? As for real-time-ness... i guess that would require c++ level support, unless you were keen to implement the indexing function in a client library. mapreduce is currently the only option for non-blocking server-side JS execution. What do you mean by "keyword search" specifically? | ||||||||||||||||
| Comment by Rob [ 06/Aug/10 ] | ||||||||||||||||
|
I'd also like to vote for these:
I'm all for keeping MongoDB simple as others have stated. I agree how stemming/wildcards could be deferred to advanced cases. But if you leave out phrase searching, there's no advantage over just breaking words into an array, is there? | ||||||||||||||||
| Comment by Andrew G [ 05/Aug/10 ] | ||||||||||||||||
|
In case anyone is interested, I have written a prototype desktop application for which I need a database with text index/ search facility (with incremental updates). It runs on kubuntu if deb package python-storm is installed. It should run under Windows if you install enough things (Python 2.6, Qt, PyQt, Canonical Storm). http://kde-apps.org/content/show.php/Knowledge?content=111504 article here http://dot.kde.org/2010/06/29/knowledge-different-approach-database-desktop | ||||||||||||||||
| Comment by dan [ 26/Jul/10 ] | ||||||||||||||||
|
My company has written a mongo-native full-text search. Currently supports English - although stemmer commits are welcome. There is also a python library, which has substantial extra functionality because of restrictions on server-side javascript execution. Indexing happens via mapreduce for maximual concurrency. v8 build recommended for speed- our trials report about 4x speed increase | ||||||||||||||||
| Comment by gf [ 26/Jul/10 ] | ||||||||||||||||
|
http://sphinxsearch.com/ | ||||||||||||||||
| Comment by Roger Binns [ 17/Jul/10 ] | ||||||||||||||||
|
Another requirement for me not mentioned here is providing information for autocompleting fields in a web interface. | ||||||||||||||||
| Comment by David James [ 05/Apr/10 ] | ||||||||||||||||
|
@Nicolas Fouché: just saw your comment about using treetop to create a query language, so I wanted to share a little code I put together: http://github.com/djsun/query_string_filter which converts a 'filter' param in a query string to a hash suitable for MongoMapper. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 02/Apr/10 ] | ||||||||||||||||
|
trying to keep 1.5/1.6 very focused on sharding + replica sets. | ||||||||||||||||
| Comment by Raj Kadam [ 04/Dec/09 ] | ||||||||||||||||
|
If Sphinx does incremental updates, then yes, I believe it is at the top of the pack. | ||||||||||||||||
| Comment by gf [ 03/Dec/09 ] | ||||||||||||||||
|
Sphinx rules. The "incremental updates" feature is coming soon. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 24/Nov/09 ] | ||||||||||||||||
|
Most of what you need should be here: http://www.mongodb.org/display/DOCS/Replication+Internals | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 19/Nov/09 ] | ||||||||||||||||
|
No licensing issues I believe - certainly not in building yourself. The external hook is very easy to get started with. Very simple to tap into mongo's replication log and use that for updating indexes. Embedding is more complicated. We just built a module system that allows external modules to be linked in. the 3rd level would having it be totally built in. that's a ways off, and i'm not sure if its something that is needed anyway | ||||||||||||||||
| Comment by Nicolas Fouché [ 08/Nov/09 ] | ||||||||||||||||
|
I've developed my own search feature on MongoDB, with $ft keys. If you store documents with all languages, stemming is too much pain. (language detection, then apply the good stemmer if one can be applied - and sometimes language detection per paragraph/sentence is needed). Like Eliot says, if MongoDB embeds a full-text search feature, it should be as minimal as possible. Extract words, convert them to ASCII, remove 1 or 2 characters words, and puts them in an Array key. The more MongoDB does, the more choices they make, and the less use cases they'll match. Stemming can be (kind of) replaced by expanding the query http://en.wikipedia.org/wiki/Query_expansion . For that I still wait for the $or feature You build above that a query grammar (with Treetop for example http://treetop.rubyforge.org/), create multiple keys for full text and you have a search engine that supports metadata queries. e.g.: "waterdrop subject:house date:>20090101" And as soon as the $or feature is ready, the user could add OR keywords in their queries (to satisfy one or two of them). If anyone is interested, I can write blog articles to describe this solution in depth. Of course you don't have wildcards, phrase search, fuzzy search, nearby search or scoring. But I suppose that if you need this, then you definitely don't target average users. Take a look at the search feature in Facebook messages powered by Cassandra, it's horrible (not-english people would understand), it does not even find the same exact words you type in your query... but it's blazing fast and no-one complains. It seems that Twitter added phrase search recently, Digg did not, neither yammer. As a former Lucene user, I though that I needed all these features, but I discovered that none of our users asked for them, and I do not actually need them to find what I'm looking for. In startups we don't want to spend 80% of the time to satisfy 1% of our users For developers needing a featureful search, then why not considering an "external hook" à la CouchDB ? | ||||||||||||||||
| Comment by David Lehmann [ 30/Oct/09 ] | ||||||||||||||||
|
I forgot to say, that I'm not talking about of what 10gen should do as a first step | ||||||||||||||||
| Comment by David Lehmann [ 30/Oct/09 ] | ||||||||||||||||
|
@eliot: That's why I asked what we can put into the indexes and what could be done in the client implementations/manipulators I'm sure that it is possible to have real full text search in Mongo with a query language that is a superset (doh, a connected word again A first step would be to do exactly what you propose when you talk about Lucene with a Mongo backend - could be any engine with a clean separation of NLA and persistency. The simple search you described earlier would be the "ASCII of search". This would not deliver the wanted results in languages like French, German ... you name it we have it | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 30/Oct/09 ] | ||||||||||||||||
|
Maybe. It could even be clucene with a mongo backend. | ||||||||||||||||
| Comment by Sebastian Friedel [ 30/Oct/09 ] | ||||||||||||||||
|
@Alan: having integrated search engines several times in the last years I know that every serious search implements at least prefix or infix search alongside with stemming | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 30/Oct/09 ] | ||||||||||||||||
|
A lot of it comes down to covering all the basics while keeping it simple and fast. also, while we work on basic full text search inside Mongo, it still might make sense for more adapters to an outside, non-realtime search engine that could be more complex. | ||||||||||||||||
| Comment by Alan Wright [ 30/Oct/09 ] | ||||||||||||||||
|
@Sebastian - I would take a look at the Lucene Wildcards (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Wildcard%20Searches). This would provide the capability you need (eg. searching for "water*" would find "waterdrop" and "waterbed") | ||||||||||||||||
| Comment by Sebastian Friedel [ 30/Oct/09 ] | ||||||||||||||||
|
I don't think that it is so seldomly needed. Think of things like 'waterbed' or 'waterdrop' ... as a user I would find it very irritating if I wouldn't find any of those when I search for 'water'. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 30/Oct/09 ] | ||||||||||||||||
|
i don't think version 1 would have substring matching. i think that would be a different mode, since its a lot more costly, and not as often needed | ||||||||||||||||
| Comment by Sebastian Friedel [ 30/Oct/09 ] | ||||||||||||||||
|
hello everyone | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 30/Oct/09 ] | ||||||||||||||||
|
the index would just be the words. | ||||||||||||||||
| Comment by David Lehmann [ 30/Oct/09 ] | ||||||||||||||||
|
@eliot good to know | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 30/Oct/09 ] | ||||||||||||||||
|
@David - right, that's what i meant as well | ||||||||||||||||
| Comment by David Lehmann [ 30/Oct/09 ] | ||||||||||||||||
|
With distance I mean the distance used in proximity searches not weight. | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 29/Oct/09 ] | ||||||||||||||||
|
I think distance, etc.. can be done on the query side rather than the indexing side. 1) find "cool" if its done on the query side can use multiple cores, etc... | ||||||||||||||||
| Comment by David Lehmann [ 29/Oct/09 ] | ||||||||||||||||
|
@alan: I guess it would be harder to use Mongo replication/sharding facilities if the queries are not just plain Mongo queries but I'm not sure about this. The important functionality of the FTS engine is the analysis/transformation of the input data ... could be wrong on this but querying should be left to Mongo and therefore the "query language" of Mongo should be used. Combined with map/reduce this would be very powerfull. The transformation from a high level query language to regular Mongo queries or map/reduce should be in the application layer or maybe better in the language specific driver. @raj: If the FTS uses Mongo indexes, the penalty is payed while inserting/deleting a doc. The index is up-to-date after the successful insert/delete. It's the same with the standard indexes. Fulltext indexes for collections that have high insert/delete rates are even more counterproductive as regular indexes because of the nature of natural language analysis algorithms. This should get even worse if word distance is part of the feature set. Maybe I'm thinking to complicated and a simple OR based prefix-, infix- and postfix-keyword-search with snowball stemming for the easy stemmable languages would be fine for 99% of Mongos users. Will ask that in #mongodb if I find time. | ||||||||||||||||
| Comment by Raj Kadam [ 29/Oct/09 ] | ||||||||||||||||
|
These are all good suggestions, but what would be the latency from time to insert to time in ft index? I mean what is the ideal latency? | ||||||||||||||||
| Comment by Alan Wright [ 29/Oct/09 ] | ||||||||||||||||
|
Wouldn't the query follow the Lucene query syntax? (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html) It might also be useful to search specified fields (also described in the Lucene syntax)... db.foo.find( { $ft : "title:cool AND title:stuff" }) db.foo.find( { $ft : "title:cool OR test:stuff" }) At least in the first version? In future versions, indexing could be expanded to perhaps allow specifying language analysers (useful when stemming)... db.foo.ensureIndex( { title : 1 }, { fullTextSearch : true }, { textAnalyzer : standard } ) , { fullTextSearch : true }, { textAnalyzer : snowball }, { textLanguage : German }) | ||||||||||||||||
| Comment by David Lehmann [ 29/Oct/09 ] | ||||||||||||||||
|
sry, mixed up manipulator with modifiers ^^ corrected in my previous comment i don't see how a simple {$ft: "something keyword like"} could help me in finding docs if i have more than one ftindex. a keyword search could/should be something like the $or query in SERVER-205 {$ft: {title: ["keyword", "search"]}} | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 29/Oct/09 ] | ||||||||||||||||
|
keywork search | ||||||||||||||||
| Comment by David Lehmann [ 29/Oct/09 ] | ||||||||||||||||
|
@eliot maybe i got it wrong. {$ft: "cool stuff"} means phrase-, key-search or query language? | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 29/Oct/09 ] | ||||||||||||||||
|
@david Not sure what you mean my that? | ||||||||||||||||
| Comment by David Lehmann [ 29/Oct/09 ] | ||||||||||||||||
|
@eliot yep, that's what i wanted to say what do you think about the idea of implementing the query parser as a client side manipulator? | ||||||||||||||||
| Comment by Alan Wright [ 29/Oct/09 ] | ||||||||||||||||
|
Eliot - that looks good. Adding full text searching to MongoDB and making it as easy to use as you've described would be fantastic! | ||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 29/Oct/09 ] | ||||||||||||||||
|
I think the right way to do this is the following.
| ||||||||||||||||
| Comment by David Lehmann [ 29/Oct/09 ] | ||||||||||||||||
|
If Mongo gets integrated full text search (FTS), then it should be as light and concise as the existing index functionality. This means, that I'm strongly opposed to any kind of schema except it is as small and simple like the ones used for indexes and unique indexes. If we design it with the regular indexes in mind we could assume two things: it is only useful on text fields and the ordering of the index is less important. The "order"-parameter in the index creation/ensuring call could be used for a weight value if we want to recycle the regular index creation mechanics. We could use an option "fulltext" as we do right now for "unique". The performance tradeoffs when using FTS should be documented like it already is for indexes and unique indexes. The inclusion of documents in the index should be triggered by existence and type check for the index fields on the document to write/update. After looking at Sphinx http://www.sphinxsearch.com/docs/current.html, Xapian http://xapian.org/ and CLucene http://sourceforge.net/projects/clucene/ it seems that CLucene is the most flexible. Correct me if I'm wrong but neither Sphinx nor Xapian support custom persistency implementations. Even if we could store the index files of those two engines in GridFS this should not be the way to go. Mongo is an extremely fast database and progresses in light speed when it comes to features for easy replication and usage in cluster architectures. Any other persistency mechanism used for data that could also be stored in Mongo just increases complexity of the setup and brings new and unnecessary problems for both, the developers and users of Mongo. Using Mongo as persistency layer would ensure the availability of its features for clustering (sharding and map/reduce). When it comes to the feature set, stemming, keyword search and phrase search are absolutely necessary. The querying mechanics should not include any parsing of fancy search input IMHO but use the Mongo $opcodes. Query languages should also be a part of its own and can be integrated in the language specific drivers via the manipulator mechanics. A simple query parser could be shipped with the Mongo sources and driver providers could use it via the language specific foreign language interface or wrapper builders like SWIG http://www.swig.org/. It would be nice if we isolate the parts of FTS that are necessary to build a common infrastructure for FTS integration into Mongo. As mentioned by Alan Wright, CLucene has a clean separation of FTS and persistency and therefore could be a good starting point for our efforts. A common infrastructure would help everybody interested in integrating his or her preferred full text search engine. | ||||||||||||||||
| Comment by Richard Boulton [ 22/Oct/09 ] | ||||||||||||||||
|
Hi - Xapian developer here, and Mongo DB enthusiast (though I've not had an excuse to play with it in anger yet). I'd like to help make a tight integration between Xapian and MongoDB, if there's interest in it. I'm not quite sure what the best approach for linking would be, though. Xapian certainly supports "realtime" updates in the sense described above. It also has some features in trunk for supporting replication of the index, which might be helpful when working with MongoDB. One basic approach would be to hook into the updates in Mongo somehow, and send them across to a parallel Xapian index for full-text indexing. I think this would be best done by defining some kind of schema, though: often, when searching, you want to search across a fairly complex set of fields (a common example is to search across both title fields and content fields, but to boost the importance of the title fields - but in real world search situations, you often come up with much more complex requirements). A naive mapping of a particular field in Mongo to a search index would allow basic search, but we can do much better than that, I think. There are also things like thesaurus entries and spelling correction, which you would want to be able to configure somehow. Mongo doesn't really have schemas yet, IIRC, so I'm not sure how the Mongo developers would feel about adding that sort of context. When defining searches, Xapian has a built-in and flexible query parser (which is aimed at parsing queries entered into a search box by the average untrained user, so supports some structure (eg, field:value), but copes with any random input in a sane way). It can also have structured searches built up, and combined with the output of parsing several user inputs, so a mapping from a Mongo-style query to a Xapian search could be defined to limit Mongo results. Xapian also has things called "External Posting Sources" which are arbitrary C++ classes (subclassing a Xapian::PostingSource class), which can be used to perform combined searches across data stored in the Xapian index, and external data. (A "posting source" in search engine terminology is a list of documents matching a particular word (or term) and is the fundamental piece of data stored in a search engine index.) This could be used to limit searches to documents matching a MongoDB query pretty efficiently, without having to store extra data in Xapian. | ||||||||||||||||
| Comment by Alan Wright [ 22/Oct/09 ] | ||||||||||||||||
|
I would put forward that a better candidate for full-text search would be CLucene (http://clucene.sourceforge.net/) - a C++ port of the popular java Lucene engine. Storing the index inside MongoDB would be a simple case of overriding the lucene::store::Directory class to point to MongoDB instead of the file-system. | ||||||||||||||||
| Comment by Raj Kadam [ 22/Oct/09 ] | ||||||||||||||||
|
I guess what I meant by realtime is the full-text engine needs to allow "incremental updates" to the full-text index. If it has to re-index stuff over and over again like Sphinx that is really taxing on the CPU and increases latency drastically under high load environments. | ||||||||||||||||
| Comment by Jeremy Hinegardner [ 22/Oct/09 ] | ||||||||||||||||
|
I brought up Xapian (http://xapian.org) on the mongodb-users list as a possible library for use:
As for 'real-time' I would suspect its as real-time as any other full text search library. I have no opinion on whether this should be used or not, it just sounded like a possible good match with mongodb. |