[SERVER-380] basic text indexing and search Created: 21/Oct/09  Updated: 29/Apr/13  Resolved: 07/Jan/13

Status: Closed
Project: Core Server
Component/s: Indexing, Querying
Affects Version/s: None
Fix Version/s: 2.3.2

Type: New Feature Priority: Major - P3
Reporter: Raj Kadam Assignee: Eliot Horowitz
Resolution: Fixed Votes: 239
Labels: None

Issue Links:
is depended on by SERVER-1635 Faceting support Closed
Backwards Compatibility: Fully Compatible


Simple text indexing.
Initial version will be marked as experimental, so has to be turned on with a command line flag or at runtime.

db.adminCommand( { setParameter : "*", textSearchEnabled : true } );


./mongod --setParameter textSearchEnabled=true

Only works via "text" command currently.


  • parsing + stemming for latin languages
  • multi-word scoring
  • phrase matching
  • word negation
  • weights per field
  • additional suffix index fields for coverings
  • additional prefix index fields for partitioning
  • specify language per document

Simple Example:

> db.foo.insert( { _id: 1 , desc: "the dog is running" } )
> db.foo.insert( { _id: 2 , desc: "the cat is walking" } )
> db.foo.ensureIndex( { "desc": "text" } )
> db.foo.runCommand( "text", { search : "walk" } )
	"queryDebugString" : "walk||||||",
	"language" : "english",
	"results" : [
			"score" : 0.75,
			"obj" : {
				"_id" : 2,
				"desc" : "the cat is walking"
	"stats" : {
		"nscanned" : 1,
		"nscannedObjects" : 0,
		"n" : 1,
		"timeMicros" : 330
	"ok" : 1

Comment by Jeremy Hinegardner [ 22/Oct/09 ]

I brought up Xapian (http://xapian.org) on the mongodb-users list as a possible library for use:


  • Written in C++, so would probably work well with MongoDB
  • Has keyword searching
  • Phrase searching
  • boolean logic
  • stemming

As for 'real-time' I would suspect its as real-time as any other full text search library.

I have no opinion on whether this should be used or not, it just sounded like a possible good match with mongodb.

Comment by Raj Kadam [ 22/Oct/09 ]

I guess what I meant by realtime is the full-text engine needs to allow "incremental updates" to the full-text index. If it has to re-index stuff over and over again like Sphinx that is really taxing on the CPU and increases latency drastically under high load environments.

Comment by Alan Wright [ 22/Oct/09 ]

I would put forward that a better candidate for full-text search would be CLucene (http://clucene.sourceforge.net/) - a C++ port of the popular java Lucene engine.

Storing the index inside MongoDB would be a simple case of overriding the lucene::store::Directory class to point to MongoDB instead of the file-system.

Comment by Richard Boulton [ 22/Oct/09 ]

Hi - Xapian developer here, and Mongo DB enthusiast (though I've not had an excuse to play with it in anger yet). I'd like to help make a tight integration between Xapian and MongoDB, if there's interest in it. I'm not quite sure what the best approach for linking would be, though.

Xapian certainly supports "realtime" updates in the sense described above. It also has some features in trunk for supporting replication of the index, which might be helpful when working with MongoDB.

One basic approach would be to hook into the updates in Mongo somehow, and send them across to a parallel Xapian index for full-text indexing. I think this would be best done by defining some kind of schema, though: often, when searching, you want to search across a fairly complex set of fields (a common example is to search across both title fields and content fields, but to boost the importance of the title fields - but in real world search situations, you often come up with much more complex requirements). A naive mapping of a particular field in Mongo to a search index would allow basic search, but we can do much better than that, I think.

There are also things like thesaurus entries and spelling correction, which you would want to be able to configure somehow.

Mongo doesn't really have schemas yet, IIRC, so I'm not sure how the Mongo developers would feel about adding that sort of context.

When defining searches, Xapian has a built-in and flexible query parser (which is aimed at parsing queries entered into a search box by the average untrained user, so supports some structure (eg, field:value), but copes with any random input in a sane way). It can also have structured searches built up, and combined with the output of parsing several user inputs, so a mapping from a Mongo-style query to a Xapian search could be defined to limit Mongo results.

Xapian also has things called "External Posting Sources" which are arbitrary C++ classes (subclassing a Xapian::PostingSource class), which can be used to perform combined searches across data stored in the Xapian index, and external data. (A "posting source" in search engine terminology is a list of documents matching a particular word (or term) and is the fundamental piece of data stored in a search engine index.) This could be used to limit searches to documents matching a MongoDB query pretty efficiently, without having to store extra data in Xapian.

Comment by David Lehmann [ 29/Oct/09 ]

If Mongo gets integrated full text search (FTS), then it should be as light and concise as the existing index functionality. This means, that I'm strongly opposed to any kind of schema except it is as small and simple like the ones used for indexes and unique indexes. If we design it with the regular indexes in mind we could assume two things: it is only useful on text fields and the ordering of the index is less important. The "order"-parameter in the index creation/ensuring call could be used for a weight value if we want to recycle the regular index creation mechanics. We could use an option "fulltext" as we do right now for "unique". The performance tradeoffs when using FTS should be documented like it already is for indexes and unique indexes. The inclusion of documents in the index should be triggered by existence and type check for the index fields on the document to write/update.

After looking at Sphinx http://www.sphinxsearch.com/docs/current.html, Xapian http://xapian.org/ and CLucene http://sourceforge.net/projects/clucene/ it seems that CLucene is the most flexible. Correct me if I'm wrong but neither Sphinx nor Xapian support custom persistency implementations. Even if we could store the index files of those two engines in GridFS this should not be the way to go. Mongo is an extremely fast database and progresses in light speed when it comes to features for easy replication and usage in cluster architectures. Any other persistency mechanism used for data that could also be stored in Mongo just increases complexity of the setup and brings new and unnecessary problems for both, the developers and users of Mongo. Using Mongo as persistency layer would ensure the availability of its features for clustering (sharding and map/reduce).

When it comes to the feature set, stemming, keyword search and phrase search are absolutely necessary. The querying mechanics should not include any parsing of fancy search input IMHO but use the Mongo $opcodes. Query languages should also be a part of its own and can be integrated in the language specific drivers via the manipulator mechanics. A simple query parser could be shipped with the Mongo sources and driver providers could use it via the language specific foreign language interface or wrapper builders like SWIG http://www.swig.org/.

It would be nice if we isolate the parts of FTS that are necessary to build a common infrastructure for FTS integration into Mongo. As mentioned by Alan Wright, CLucene has a clean separation of FTS and persistency and therefore could be a good starting point for our efforts. A common infrastructure would help everybody interested in integrating his or her preferred full text search engine.

Comment by Eliot Horowitz [ 29/Oct/09 ]

I think the right way to do this is the following.

  • make the db take: db.foo.ensureIndex( { title : 1 }


    { fullTextSearch : true }


    { title : 1 , test : .2 }


    { fullTextSearch : true }


  • modify the indexing code to use clucene (or whichever) engine from tokenzing, stemming, and put those in the index
  • query would be something like: db.foo.find( { $ft : "cool stuff" }


Comment by Alan Wright [ 29/Oct/09 ]

Eliot - that looks good.

Adding full text searching to MongoDB and making it as easy to use as you've described would be fantastic!

Comment by David Lehmann [ 29/Oct/09 ]

@eliot yep, that's what i wanted to say

what do you think about the idea of implementing the query parser as a client side manipulator?

Comment by Eliot Horowitz [ 29/Oct/09 ]

@david Not sure what you mean my that?

Comment by David Lehmann [ 29/Oct/09 ]

@eliot maybe i got it wrong.

{$ft: "cool stuff"} means phrase-, key-search or query language?

Comment by Eliot Horowitz [ 29/Oct/09 ]

keywork search
so "cool stuff" would look up matches for cool, and stuff, and then intersection, scoring, etc...
could eventually go into a query language, but not initally probably

Comment by David Lehmann [ 29/Oct/09 ]

sry, mixed up manipulator with modifiers ^^ corrected in my previous comment

i don't see how a simple {$ft: "something keyword like"} could help me in finding docs if i have more than one ftindex.

a keyword search could/should be something like the $or query in SERVER-205

{$ft: {title: ["keyword", "search"]}}

Comment by Alan Wright [ 29/Oct/09 ]

Wouldn't the query follow the Lucene query syntax? (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html)

It might also be useful to search specified fields (also described in the Lucene syntax)...


{ $ft : "title:cool AND title:stuff" }



{ $ft : "title:cool OR test:stuff" }


At least in the first version?

In future versions, indexing could be expanded to perhaps allow specifying language analysers (useful when stemming)...


{ title : 1 }


{ fullTextSearch : true }


{ textAnalyzer : standard }


{ title : 1 }


{ fullTextSearch : true }


{ textAnalyzer : snowball }


{ textLanguage : German }


Comment by Raj Kadam [ 29/Oct/09 ]

These are all good suggestions, but what would be the latency from time to insert to time in ft index? I mean what is the ideal latency?

Comment by David Lehmann [ 29/Oct/09 ]

@alan: I guess it would be harder to use Mongo replication/sharding facilities if the queries are not just plain Mongo queries but I'm not sure about this. The important functionality of the FTS engine is the analysis/transformation of the input data ... could be wrong on this but querying should be left to Mongo and therefore the "query language" of Mongo should be used. Combined with map/reduce this would be very powerfull. The transformation from a high level query language to regular Mongo queries or map/reduce should be in the application layer or maybe better in the language specific driver.

@raj: If the FTS uses Mongo indexes, the penalty is payed while inserting/deleting a doc. The index is up-to-date after the successful insert/delete. It's the same with the standard indexes. Fulltext indexes for collections that have high insert/delete rates are even more counterproductive as regular indexes because of the nature of natural language analysis algorithms. This should get even worse if word distance is part of the feature set.

Maybe I'm thinking to complicated and a simple OR based prefix-, infix- and postfix-keyword-search with snowball stemming for the easy stemmable languages would be fine for 99% of Mongos users. Will ask that in #mongodb if I find time.

Comment by Eliot Horowitz [ 29/Oct/09 ]

I think distance, etc.. can be done on the query side rather than the indexing side.
a search for "cool stuff"

1) find "cool"
2) find "stuff"
3) find intersection
4) score that subset

if its done on the query side can use multiple cores, etc...

Comment by David Lehmann [ 30/Oct/09 ]

With distance I mean the distance used in proximity searches not weight.

Comment by Eliot Horowitz [ 30/Oct/09 ]

@David - right, that's what i meant as well

Comment by David Lehmann [ 30/Oct/09 ]

@eliot good to know Which data would you put in the index and how would you do the scoring?

Comment by Eliot Horowitz [ 30/Oct/09 ]

the index would just be the words.
scoring would be after and can be based on proximity, etc...

Comment by Sebastian Friedel [ 30/Oct/09 ]

hello everyone
ok, so if you want to have infix search with a minimal infix length of 2 you'll have an index as follows:
field: ['co', 'oo', 'ol', 'coo', 'ool', 'cool', 'st', 'tu', 'uf', 'ff', 'stu', 'tuf', 'uff', 'stuf', 'tuff', 'stuff']
how would you extrapolate word proximity from that?

Comment by Eliot Horowitz [ 30/Oct/09 ]

i don't think version 1 would have substring matching.
more like a regular search engine.

i think that would be a different mode, since its a lot more costly, and not as often needed

Comment by Sebastian Friedel [ 30/Oct/09 ]

I don't think that it is so seldomly needed. Think of things like 'waterbed' or 'waterdrop' ... as a user I would find it very irritating if I wouldn't find any of those when I search for 'water'.
And there are languages where connected words are much more common than in english.
I could give more examples. but I think the above already states my point.

Comment by Alan Wright [ 30/Oct/09 ]

@Sebastian - I would take a look at the Lucene Wildcards (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Wildcard%20Searches). This would provide the capability you need (eg. searching for "water*" would find "waterdrop" and "waterbed")

Comment by Eliot Horowitz [ 30/Oct/09 ]

A lot of it comes down to covering all the basics while keeping it simple and fast.
Since this will be real time - we need to be very speed aware and really just make sure it covers most use cases.
While the "water" example is good, I still think its less common and adds almost an oder of magnitude more complexity.
not saying it can't happen, just saying probably not for version 1.

also, while we work on basic full text search inside Mongo, it still might make sense for more adapters to an outside, non-realtime search engine that could be more complex.

Comment by Sebastian Friedel [ 30/Oct/09 ]

@Alan: having integrated search engines several times in the last years I know that every serious search implements at least prefix or infix search alongside with stemming
@Eliot: ok, so you would like the fts in mongodb to be just basic and simple and let the more 'fancy' use cases be handled by specialized software such as sphinx/lucene/xapian?

Comment by Eliot Horowitz [ 30/Oct/09 ]


It could even be clucene with a mongo backend.
They key is that what's "builtin" needs to be real time and very efficient.
So we want to do the minimal work that solves a lot of real world problems.

Comment by David Lehmann [ 30/Oct/09 ]

@eliot: That's why I asked what we can put into the indexes and what could be done in the client implementations/manipulators

I'm sure that it is possible to have real full text search in Mongo with a query language that is a superset (doh, a connected word again ) of standard queries. We have to find out which parts have to stay in the clients or application layer and what we wanna do in the write/delete ops of the DB. Main "problem" are Mongo indexes - but just if we wanna use them for storage of the entire search indexes. This gets clear if we take a closer look at proximity search but it looks not like a long time blocker to me.

A first step would be to do exactly what you propose when you talk about Lucene with a Mongo backend - could be any engine with a clean separation of NLA and persistency. The simple search you described earlier would be the "ASCII of search". This would not deliver the wanted results in languages like French, German ... you name it we have it ) IMHO everyone who wants to have non english FTS in Mongo just would not use the simple one because it wouldn't cover their needs.

Comment by David Lehmann [ 30/Oct/09 ]

I forgot to say, that I'm not talking about of what 10gen should do as a first step Sebastian and I have plans to implement FTS anyway and just want to make sure that we don't just cover our needs.

Comment by Nicolas Fouché [ 08/Nov/09 ]

I've developed my own search feature on MongoDB, with $ft keys. If you store documents with all languages, stemming is too much pain. (language detection, then apply the good stemmer if one can be applied - and sometimes language detection per paragraph/sentence is needed).

Like Eliot says, if MongoDB embeds a full-text search feature, it should be as minimal as possible. Extract words, convert them to ASCII, remove 1 or 2 characters words, and puts them in an Array key. The more MongoDB does, the more choices they make, and the less use cases they'll match.

Stemming can be (kind of) replaced by expanding the query http://en.wikipedia.org/wiki/Query_expansion . For that I still wait for the $or feature Query expansion adds more processing at runtime, but makes the database a lot more flexible. No need to migrate data each time you enhance your algorithms for example.

You build above that a query grammar (with Treetop for example http://treetop.rubyforge.org/), create multiple keys for full text and you have a search engine that supports metadata queries. e.g.: "waterdrop subject:house date:>20090101" And as soon as the $or feature is ready, the user could add OR keywords in their queries (to satisfy one or two of them).

If anyone is interested, I can write blog articles to describe this solution in depth.

Of course you don't have wildcards, phrase search, fuzzy search, nearby search or scoring. But I suppose that if you need this, then you definitely don't target average users. Take a look at the search feature in Facebook messages powered by Cassandra, it's horrible (not-english people would understand), it does not even find the same exact words you type in your query... but it's blazing fast and no-one complains. It seems that Twitter added phrase search recently, Digg did not, neither yammer. As a former Lucene user, I though that I needed all these features, but I discovered that none of our users asked for them, and I do not actually need them to find what I'm looking for. In startups we don't want to spend 80% of the time to satisfy 1% of our users

For developers needing a featureful search, then why not considering an "external hook" à la CouchDB ?

Comment by Eliot Horowitz [ 19/Nov/09 ]

No licensing issues I believe - certainly not in building yourself.

The external hook is very easy to get started with. Very simple to tap into mongo's replication log and use that for updating indexes.
I'd love to work with anyone on that.

Embedding is more complicated. We just built a module system that allows external modules to be linked in.
We're going to be adding more hooks into that, so it would be possible to have a c++ trigger basically.
That would be a good 2nd level integration.

the 3rd level would having it be totally built in. that's a ways off, and i'm not sure if its something that is needed anyway

Comment by Eliot Horowitz [ 24/Nov/09 ]

Most of what you need should be here: http://www.mongodb.org/display/DOCS/Replication+Internals
the transaction log is a capped collection, so its just a matter of reading from that.

Comment by gf [ 03/Dec/09 ]

Sphinx rules. The "incremental updates" feature is coming soon.
Full re-indexing is possible using "xmlpipe2" data-source.
Live updates will be possible using replication (like MongoNode).

Comment by Raj Kadam [ 04/Dec/09 ]

If Sphinx does incremental updates, then yes, I believe it is at the top of the pack.

Comment by Eliot Horowitz [ 02/Apr/10 ]

trying to keep 1.5/1.6 very focused on sharding + replica sets.
will try to get it out asap so we can go back to features like these

Comment by David James [ 05/Apr/10 ]

@Nicolas Fouché: just saw your comment about using treetop to create a query language, so I wanted to share a little code I put together: http://github.com/djsun/query_string_filter which converts a 'filter' param in a query string to a hash suitable for MongoMapper.

Comment by Roger Binns [ 17/Jul/10 ]

Another requirement for me not mentioned here is providing information for autocompleting fields in a web interface.

Comment by gf [ 26/Jul/10 ]

Sphinx 1.10-beta is out including real-time indexes.

Comment by dan [ 26/Jul/10 ]

My company has written a mongo-native full-text search. Currently supports English - although stemmer commits are welcome. There is also a python library, which has substantial extra functionality because of restrictions on server-side javascript execution. Indexing happens via mapreduce for maximual concurrency. v8 build recommended for speed- our trials report about 4x speed increase


Comment by Andrew G [ 05/Aug/10 ]

In case anyone is interested, I have written a prototype desktop application for which I need a database with text index/ search facility (with incremental updates). It runs on kubuntu if deb package python-storm is installed. It should run under Windows if you install enough things (Python 2.6, Qt, PyQt, Canonical Storm).


article here


Comment by Rob [ 06/Aug/10 ]

I'd also like to vote for these:

  • Search by keywords
  • Real-time
  • Allow for phrase searching

I'm all for keeping MongoDB simple as others have stated. I agree how stemming/wildcards could be deferred to advanced cases.

But if you leave out phrase searching, there's no advantage over just breaking words into an array, is there?

Comment by dan [ 07/Aug/10 ]

you can add phrasal searching to the library we have produced using simple javascript. patches welcome. we've been keen to keep the library fast and simple so far, and haven't added phrasal search ad such because a well-ranked stemming search has done the job for us very well?

As for real-time-ness... i guess that would require c++ level support, unless you were keen to implement the indexing function in a client library. mapreduce is currently the only option for non-blocking server-side JS execution.

What do you mean by "keyword search" specifically?

Comment by Michael Stephens [ 09/Aug/10 ]

I've started work on a little tool (http://github.com/mikejs/photovoltaic) that uses mongo's replication internals to automatically keep a solr index up to date. It's rough around the edges but my FTS needs aren't very fancy.

Comment by Eric Mill [ 25/Aug/10 ]

Is this still targeted for release in 1.7? I'm gauging whether I should hold out, or go try to integrate with Solor.

Comment by Eliot Horowitz [ 25/Aug/10 ]

Its being considered for 1.7.x
Not committed yet - though we'd like to

Comment by dan [ 26/Aug/10 ]

@eric - So you didn't like our native Mongo search? Is phrasal searching the only show stopper?
@eliot - I'm hoping you folks will use the work we did on out FTS for mongo itself. It has test suites and such. Of course, if you are going to roll a new one in raw C++ that will change things.

Comment by Eliot Horowitz [ 26/Aug/10 ]

@dan - whatever we do will definitely be embedded in the db written in c++

Comment by Eric Mill [ 06/Sep/10 ]

@dan - I'd consider using your library if there were documentation - I have little experience with MongoDB plugins and I'm not sure how to use your code. Phrasal search is important, but I would be fine using a solution that lacked it as a stopgap.

@eliot - I'll be crossing my fingers, then.

Comment by huangzhijian [ 09/Sep/10 ]

@Eliot Horowitz iam really excitely expecting the full-text search functionality , it must be fantastic if it can support chinese language

Comment by Eliot Horowitz [ 15/Sep/10 ]

Does anyone watching this case have a dataset and tests with another system (mysql,xapian,etc...)
Would be very helpful.

Comment by Johan Bergström [ 15/Sep/10 ]

Eliot: I have a system using postgres to mongo (transition period) and xapian for full text search. What kind of input do you seek? (sorry, this isn't oss)

Comment by Eliot Horowitz [ 15/Sep/10 ]

Ideally a data set and some test cases. Don't need any code.

Comment by Walt Woods [ 07/Oct/10 ]

@Eliot Horowitz - Question about your above-mentioned ensureIndex(

{ title: 1, body: 0.2 }


{ fullTextSearch: true }

) API example; are these per-word weights or per-document weights?

Comment by Eliot Horowitz [ 07/Oct/10 ]

@walt - idea is per word if understand what you mean

Comment by Walt Woods [ 08/Oct/10 ]

@Eliot Horowitz - Ah... Do you think it would be a possibility to provide per-document weights? Titles are almost always short, but bodies are much more variable; in my opinion, it's not very fair to count a longer document as more pertinent to the subject requested.

Comment by Eliot Horowitz [ 08/Oct/10 ]

Can you describe exactly what you mean by per word and per document

Comment by Walt Woods [ 08/Oct/10 ]

Yeah; per-word weighting: Count occurrences of word. Multiply occurrences by weight for that word's total weight in document. Store in index.

Per-document weighting: Count occurrences of a single word, divide by total occurrence count for the indexed field (e.g. # of words). Multiply this fraction by the specified index weight. Store in index.

Essentially, using weights to bound the effects of any single field in the full text index with respect to the total document score. Maybe this is obvious and how it was going to be done anyway? Admissibly, I don't have much experience with other full-text indexers... just what I've done in experimenting with my own variation for awhile now.

Comment by Roger Binns [ 08/Oct/10 ]

@Walt: There are already scoring algorithms developed and tuned over the years. For example see BM25:


BM25F would take into account multiple fields for a document (eg a title in addition to a body).

There is an open source pure Python text search library Whoosh that implements this scoring algorithm hence providing some nice reference code. I believe it is also part of Xapian etc.


Comment by Eliot Horowitz [ 12/Nov/10 ]

We have a proof of concept working in the lab - but want to make sure its rock solid before releasing

Comment by David Lee [ 12/Nov/10 ]

That's great! Do you plan on supporting stemming and phrase matching?

Comment by Eliot Horowitz [ 12/Nov/10 ]

Stemming for sure - phrase may or not be in version 1.

Comment by gf [ 12/Nov/10 ]

Eliot Horowitz: why don't you want to use Sphinx? Implementation is easy for sure. That's feature-rich and robust.

Comment by Mitch Pirtle [ 12/Nov/10 ]

Sphinx doesn't have partial updates to indexes IIRC. You'd have to regenerate all indexes from scratch for every update.

I've been looking into Elastic Search as a workaround for now, but other priorities keep me from getting it done.

Comment by gf [ 12/Nov/10 ]

Mitch Pirtle: Check the sphinx site. "Jul 19, 2010. Sphinx 1.10-beta is out: We're happy to release Sphinx 1.10-beta, with a new major version number that means, as promised, !!!Unable to render embedded object: File (real-time) not found.!! indexes support."

Comment by David Lee [ 12/Nov/10 ]

gf: The Sphinx real-time indexing is only available for SphinxQL, which is not a good option for MongoDB.

Comment by Matt L [ 12/Nov/10 ]

..and the Sphinx 1.10 (with RT) is only available as a beta (so far), and has a lot of bugs (see their's bug-tracker/forum). Tried use it on my project, but gave up.

Comment by Eliot Horowitz [ 12/Nov/10 ]

A good sphinx adapter would be good as well, but we want something embedded with no external depencies so for most cases you don't need any other systems.

Comment by Rick Sandhu [ 05/Dec/10 ]

+1 for embedded full text search functionality

Comment by Rick Sandhu [ 06/Dec/10 ]

@Eliot are you still looking for sample datasets with test cases? how large of a dataset would u like?

Comment by Andrew Armstrong [ 15/Dec/10 ]

Unfortunately it appears as though wikipedia's data dumps are offline at the moment due to server trouble (http://download.wikimedia.org/) but its about 30GB uncompressed to grab all of the English wiki pages (no history etc) which would probably be a neat data set to test against.

See http://en.wikipedia.org/wiki/Wikipedia:Database_download for more info, perhaps an older copy is available as a torrent/mirrored somewhere you could use.

Comment by Roger Binns [ 15/Dec/10 ]

After doing some work using MongoDB stored content and using Solr (server wrapper over Lucene) as the FTS engine, these are parts of my experience that mattered the most:

You need a list of tokenizers that run over the source data and queries. eg you need to look at quoting, embedded punctuation, capitalization etc to handle stuff like: can't. I.B.M, big:deal, will.i.am, i.cant.believe.this, up/down, PowerShot etc?

A list of filters that work on the tokens - some would replace tokens (eg you replace tokens with upper case to be entirely lower case) while others add (eg you add double metaphone representations, stemmed). Another example is a filter that replaces "4" with "(4 OR four)". If you can apply a type to the tokens then that is great so smith becomes "(smith OR DM:SMT)" and running becomes "(running OR STEM:run)". This lets you match double metaphone against double metaphone, stems against stems without "polluting" the original words.

Some way of boosting certain results. For example if using a music corpus then searches for "michael" should have higher matches for "Michael Jackson" than "Michael Johnson". Note that this is not the same as sorting the results since other tokens in the query also affect scoring. eg "Michael Rooty" needs to pick Michael Johnson's "Rooty Toot Toot for the Moon" and not any MJ song despite all MJ songs having a higher boost.

Multi-valued fields need to be handled correctly. Solr/Lucene treat them as concatenated together. For example a field name: [ "Elvis Presley", "The King"] is treated as though it was "Elvis Presley The King". If your query is for "King" then they treat that as matching one of the four tokens whereas it matches one out of the three tokens of "King of Siam". The BM25 style weighting scheme (or something substantially similar) everyone uses takes the document length into account which causes problems with the concatenation. Of course there will some documents where the multi-values should be concatenated and others where they are alternatives to each other as in my example.

You need pagination of results which usually means an implementation that caches previous queries to quickly return later parts.

Query debugging is important because you'll end up with unanticipated matches/scores and want to know why. In Solr you add a query parameter (debugQuery=true) and it returns that information. You can see what the tokenization and filtering did. You can then also see for each result how the final score was computed (a tree structure).

Comment by Eliot Horowitz [ 04/Feb/11 ]

We have some more ideas here.
Seems likely to be in 2.2

Comment by Tim Hawkins [ 04/Feb/11 ]

Check out ElasticSearch, Lucene Based, but fully utf-8, REST and JSON based. In most cases you can just pull a record from mongo and push it straight to ES just removing the "_id" field. ES then generates its own _id field (actually called "_id" too) which can be the contents of your original mongo record id as a string (you just need to put the MongoId as the identity on the PUT call to add the record to the database) Supports embedded documents, uses same dot notation for specifying embedded doc members.

Supports incremental updates, sharded indices, faceted search.

Working with ES "feels" just like working with mongo, simular interfaces, simular simplicity.


Im working on an oplog observer that will allow ES to track changes to Mongodb collections

Comment by Gerhard Balthasar [ 12/Apr/11 ]

Just stumpled over ElasticSearch, seems like the perfect addin for mongodb, or what about merging? ElasticMongo sound good in my ears..

Comment by Bryan C Green [ 30/May/11 ]

I am also interested in the solution involving ElasticSearch. I've converted from MySQL to PostgreSQL to ElasticSearch for fulltext search. I think I'm going to use ES + mongodb + PostgreSQL (for now.) I'd like some more elasticsearch rivers...

Comment by Kaspar Fischer [ 26/Jun/11 ]

I am also interested in this. If I understand correctly, a (non-application-level) integration of ElasticSearch (or Solr, or similar technology) would be much easier if triggers were available (https://jira.mongodb.org/browse/SERVER-124).

Comment by Adam Walczak [ 05/Jul/11 ]

synonym and most used word suggestion would be also nice as part of this feature

Comment by trbs [ 06/Jul/11 ]

I personally don't see how any non c/c++ system, specially not java/jvm, could be used as in embedded or internal searching system for MongoDB.

It seems to me that is not worth the time talking about in this context. For these kinds of integration developing a good external framework would be
the way forward.

Example I'm using Xapian (a project which I do consider 'possibly embeddable by MongoDB) in several of my MongoDB projects and that works perfectly
fine as a coupled system. When doing a highly specialized search of when MongoDB is just one of the sources you will often wind up with an externalized search
engine anyways.

For an integrated search I agree that something good, simple and fitting 90% of the common use cases would be much better then specialized search.

There search features particularly interesting for MongoDB, like search queries across multiple collections. How to handle embedded searching in a sharded
environment. Using replica sets to scale searching. Handling embedded documents and/or deep document structures. Etc...

Comment by rgpublic [ 07/Jul/11 ]

OK. Don't want to add too much spam, but since everyone seems to be writing about their ideas on this top-voted issue, here goes:

For our company, what's really great about MongoDB is its innovation that goes along the lines of: Create a database that finally does what IMHO it should have done all the years before i.e. dont add lots of additional work for a developer (fighting with SQL statements, data types, etc), but instead make it as easy as possible to 1) get data in 2) get data out (by searching for it). Consequently, for a full text solution to be "Mongo-like", I guess it should most importantly be seamless! The user shouldnt notice there is an external engine neither during installation nor during daily work. There shouldnt be any difference in use between a normal index and a fulltext index. You should be able to create and query them just like any other index. I don't think proposed solutions here that aim to couple projects like ElasticSearch (especially not projects of a different programming language like Java) would ever be able to meet that criterion properly. Even worse, if they are kept in sync via triggers or the like. I might be wrong, but I anticipate lots of problems if MongoDB fulltext search would work like this - like index being out of sync etc. Rather, I would prefer if the fulltext index would simply work like described here (more features added later step by step):


Only the specific details (like having to create an additional field with the words-array) should be "hidden" from the user. If one could create functional indexes (different issue) then at least one would be able to create such an index on an array of words easier without those ugly auxiliary fields.

Comment by Kaspar Fischer [ 07/Jul/11 ]

From the issue comments given so far, I read that lots of people want a simple, out-of-the-box full-text search solution. On the other hand, others want more advanced, external search solutions to be integrated. These seem to be two different concerns.

If I understand correctly then both concerns need support by MongoDB (which is not yet available, or underway): namely that a (simple or external) search solution needs to be able to learn when to index what and when to update the index. Maybe such a layer could be added and afterwards people can come up with different "plugins" that realize out-of-the-box search/integration with external search solutions? I would be happy to work on the latter but to do so, I would very much welcome an API where my search solutions can plug in (otherwise I will have to be a MongoDB internals expert).

Comment by Felix Gao [ 13/Aug/11 ]

It is been over 2 years now, what is the current status of this ticket?

Comment by Rick Sandhu [ 12/Nov/11 ]

is this ticket dead? been over 2 years with no resolutions...admins pls update! thx

Comment by Eliot Horowitz [ 16/Nov/11 ]

Not dead - just hasn't been implemented yet.

Comment by David Lee [ 16/Nov/11 ]

Is there a design waiting to be implemented? Or is there no accepted design proposal yet?

It looks like this feature request is in Planning Bucket A. What does the timeline look like for Planning Bucket A?

Comment by Eliot Horowitz [ 16/Nov/11 ]

There is a likely design.

There are just more pressing things to work on at the moment.

We're this makes it in 2012

Comment by Rouben Meschian [ 12/Dec/11 ]

Having the ability to perform full text search across the mongodb collections would be amazing.
This is a major feature that is missing from this system.

Comment by Sougata Pal [ 12/Jan/12 ]

MongoLantern can also be an good option for fulltext searching directly with mongodb without installing any new software. Though it's having few optimization issues for very large database. Hope it will be fixed in later versions.


Comment by Glenn Maynard [ 04/Apr/12 ]

-1 to integrating an actual FTS engine into Mongo. Mongo should provide generic building blocks for implementing features. Complex, domain-specific features like FTS should be layered on top of it.

Features like plugins that can receive notification of changes so they can update indexes sounds more reasonable.

Comment by Tuner [ 04/Apr/12 ]

@Glenn Maynard, I don't agree. What about map/reduce in queries using in the same time fulltext?

Comment by rgpublic [ 04/Apr/12 ]

I don't agree either. Not only map/reduce but basically any search using a combination of an ordinary query and fulltext search would be impossible. In fact, you can already install say ElasticSearch and keep that in sync with MongoDB manually. The problem comes up if you want to run a combined query. You need to learn and use two completely different query languages, download all results (possibly a lot) to the client and recombine them there.

BTW: If SERVER-153 ever gets implemented before this we could create an index on 'field.split(" ")' and query that to get a simple fulltext search Just a crazy thought...

Comment by Tuner [ 05/Apr/12 ]

@rgpublic, I'm actually using this solution and it's PAIN to have translator of queries to two databases, to MongoDB and ElasticSearch. The synchronization is done manually, ElasticSearch needs MORE fields for example to sort on RAW data - because ES is just a bunch of indexes by default which can't be even sorted by original value. So it's making queries translator EVEN MORE complex. Please guys, don't give me examples of how easy it is to make bridge between MongoDB and other full-text engine. It's not. It's not if you are doing more serious queries than "find me people that have foo bar in their description". So please don't block this idea. I believe and it's practically true, that there is no serious database system without simple full-text capabilities. MongoDB started to be "best of both worlds" and it should grow it's functionality to provide functionality of already existing and proven solutions. It's not a accident that other databases have full-text. Really, it's not.

Comment by Pawel Krakowiak [ 05/Apr/12 ]

I also use ElasticSearch alongside MongoDB. I had to install & configure it just to be able to do some keyword based searches (the workaround with arrays in MongoDB was not good enough due to text size in the database). I had to write additional code in my web app just to make the full text search work. I have to constantly synchronize the documents between Mongo and ES and bend over backwards to make the queries work (there are geospatial queries involved, ES & Mongo are not 100% on the same page). All I need is to be able to do full text queries on a couple fields. This was a serious letdown for me. Now I have to maintain & support two services just to get some queries to work. I'm dying to see this feature implemented.

Comment by Ben McCann [ 16/Apr/12 ]

Elastic search would be FAR easier to use with MongoDB if there were an elasticsearch river for MongoDB. There'd need to be a trigger or post-commit hook (https://jira.mongodb.org/browse/SERVER-124) for that to happen though.

Comment by Tegan Clark [ 04/Jul/12 ]

Any update on when full-text search is going to start appearing in nightly builds? Is this still realistic for 2012 as stated in the Nov 16 2011 comment? Thanks!

Comment by Rick Sandhu [ 10/Jul/12 ]

where do we stand on this? no response from development team is aggravating to say the least.

Comment by Eliot Horowitz [ 10/Jul/12 ]

There is still no firm date for when this is going to be done, but its definitely on the short list of new features.

Comment by Artem [ 16/Oct/12 ]

Do you plan to make this feature in 2012 ?

Comment by Tegan Clark [ 09/Nov/12 ]

Any update on when work will start on this? I'm desperate for it!

Comment by Tegan Clark [ 17/Nov/12 ]

In my mind this is heading into the very definition of Vapor-Ware; something that's been promised from Oct 2009 and still hasn't even hit nightly builds in Nov 2012! It all feels a little like wordage to convince you to not head towards couch or riak.

Now I've got to go figure out how to try and bridge Mongo and elasticsearch! It's going to get ugly!

Comment by Randolph Tan [ 17/Nov/12 ]

Hi Tegan,

While the feature is not yet ready and if you plan to use elastic search, I encourage you to check out the mongo-connector project:


Comment by Georges Polyzois [ 17/Nov/12 ]

We use elasticsearch and mongo and it works great so far. Queries are really fast even with 100 millions of docuements. The feature set and ease of use are really great - and

There is some pain of course in keeping data in sync between the two. One might even consider just using ES depending on use case

Good luck

Comment by Tegan Clark [ 18/Nov/12 ]

@Randolph Tan; firstly, thank you very much for a pointer to a very valid option I didn't know existed. Again Thanks, and I mean that.

It somehow just smells wrong though that you guys are engineering integration points to services that this feature would presumably make redundant?

Or am I just way off base?

Comment by Vilhelm K. Vardøy [ 18/Nov/12 ]

IMHO a integration to a proper search engine is a far better approach than implementing a feature like this. If full-text gets implemented in MongoDB I very much doubt it ever will be as flexible and feature rich as something that's specialized for the job.

Even though for example MySQL and other databases has full-text people have chosen to use other tools for search for ages, just because of this. I probably would choose something like ElasticSearch again if mongodb were to support basic full-text search, because of the extended functionality ES gives me. If my needs of full-text-search wasn't big, maybe a sentence here and there, I would probably just do the multikeys-thing though.

Roughly: It's about choosing the right tool for the job, where each tool is specialized for each task.

Comment by auto [ 25/Dec/12 ]


{u'date': u'2012-12-25T17:07:05Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-380: Add snowball stemmer
Branch: master

Comment by auto [ 25/Dec/12 ]


{u'date': u'2012-12-25T17:08:28Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-380: Experimental text search indexing
Branch: master

Comment by auto [ 25/Dec/12 ]


{u'date': u'2012-12-25T17:40:37Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-380: When testing via mongos, have to enable on all shards
Branch: master

Comment by auto [ 25/Dec/12 ]


{u'date': u'2012-12-25T17:42:47Z', u'email': u'eliot@10gen.com', u'name': u'Eliot Horowitz'}

Message: SERVER-380: use unique colletions for each test
Branch: master

Comment by auto [ 26/Dec/12 ]


{u'date': u'2012-12-26T01:00:42Z', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}

Message: SERVER-380 Fix access violation on Windows

Make copy of StringData passed to Tokenizer in case the original
was in a temporary.
Branch: master

Comment by Ulf Schneider [ 10/Jan/13 ]

first of all: many thanks to you for implementing this feature. now, as i can see in what direction you are working, i would like to ask the following question:
i'm storing file attachments of various formats (pdf, ms office, images and so on) via fsgrid and i can parse those attachments with apache tika for indexable text content. i get this indexable text content as a set of strings. as you can think of, the resulting sets could be large (up to megabytes). is it a suitable use case, to store those parsing results inside a field that will be text-indexed the way that you have implemented now?

Comment by Mitch Pirtle [ 10/Jan/13 ]

Wanting to log a use case that I suspect will be a bigger issue moving forward - we created a lithium plugin for is_translatable behavior, where a single document can have embedded properties for each language supported by a given app. This is also starting to crop up in the rails world, and I suspect it will grow in other php frameworks as well.

One of the major draws to MongoDB is the document model, and forcing the language at the top level is counter to that approach (at least it is for some of us).

Would it be too difficult to be able to index nested properties by language, instead of forcing the language for the entire document?

For example:

  name : "Mitch Pirtle",
  location : [
    "language" : "English",
    "country" : "Italy",
    "city" : "Turin",
    "profile" : "I write code and play bass guitar."
    "language" : "Italian",
    "country" : "Italia",
    "city" : "Torino",
    "profile" : "Scrivo il codice e il basso suonare la chitarra."

Can I index location.profile based on the specified language? Just an example, but hopefully gets my request across clearly.

Comment by Eliot Horowitz [ 10/Jan/13 ]

Mitch, definitely not for 2.4, but certainly likely in the future.
Can you open a new ticket for that?

Comment by Mitch Pirtle [ 10/Jan/13 ]

Will do, thanks for the encouraging words.

Comment by Marian Steinbach [ 25/Jan/13 ]

Great to see full text search come to life!

I'd have questions on the German stop word list (https://github.com/mongodb/mongo/blob/master/src/mongo/db/fts/stop_words_german.txt). What basis has it been generated from? Where would I add suggestions for improvement?

As it is currently, it contains nouns that definitely shouldn't be in a default stop word list, e.g. "nutzung" (usage), "schreiben" (letter / writing), "arbeiten" (works), "mann" (man), "ehe" (marriage), "frau" (woman), "bedarf" (need), "ende" (end), "fall" (case) etc.

Should I open a new ticket or directly send a pull request?

Comment by Tuner [ 25/Jan/13 ]

No Polish support ;( Can I help somehow with that?

Comment by Dan Pasette [ 25/Jan/13 ]

Marian Steinbach, can you open a new ticket for the German stop word list? It would be helpful to get your feedback.

Tuner, can you also open a new ticket for Polish support?

Comment by Tuner [ 25/Jan/13 ]

Dan Pasette, done

Comment by Marian Steinbach [ 25/Jan/13 ]

Issue on German stop word list created as SERVER-8334

Comment by Steve Schlotter [ 25/Jan/13 ]

Wow wow wow! Thank you for this feature! Is it on the road map to return a cursor instead of a document?

Comment by Eliot Horowitz [ 25/Jan/13 ]

steve - yes, likely to be in 2.6 (returning a cursor and/or including in normal query language).

Comment by Alain Cordier [ 06/Feb/13 ]

I think the french stop word list is also not as accurate as it can (contains some nouns and a lot are missing).
You can find a good one (at least for french, but I suppose some others...) here

Comment by Bojan Kostadinovic [ 29/Apr/13 ]

Eliot is there an option to do full text search but sort per field instead of score?

Comment by J Rassi [ 29/Apr/13 ]

Bojan Kostadinovic: yes, see SERVER-9392.

Comment by Bojan Kostadinovic [ 29/Apr/13 ]

Ok in other words "No, but hopefully coming with version 2.5" since that one is Unresolved

Comment by J Rassi [ 29/Apr/13 ]

Correct, that's the in-development syntax Eliot was referring to in his comment "yes, likely to be in 2.6".

Generated at Fri Oct 19 07:53:34 UTC 2018 using Jira 7.12.1#712002-sha1:609a50578ba6bc73dbf8b05dddd7c04a04b6807c.