MongoDB 3.2.0 RC4 appears to have a substantial performance regression with full text searching
3000 books obtained from Project Gutenberg (http://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) stored in MongoDB as follows:
This data was then indexed using an "all fields" index:
This produces a test dataset of around 1.1GB with a text index of 155MB (measured with WT)
This data was processed into different versions of MongoDB and various simple searches were run using words and phrases of different occurence frequencies in the dataset. This was done using the following, simple query shape in an aggregation pipeline, with the ultimate goal being to report the number of books per author containing the search word:
The words used are as follows:
- "gigantic hound"
A simple test script ("testQuery_all.js") is attached to automate this process.
All of these results were taken at the third run (i.e. to ensure that data was as warm as possible). In the case of the 3.2 results, mongod ran one core flat-out for the entire query duration.
|Version||Engine||Total Query Duration (ms)|
|3.2.0 RC4||WT Snappy||639862|
Full results are available here:
Source data is here:
Note: text index needs to be manually applied to this data: