Details
-
Improvement
-
Resolution: Cannot Reproduce
-
Major - P3
-
None
-
None
-
None
Description
Hey,
First of all, let me start by stating that up until this point, I used to be a bit of a mongodb fanboy so I wholeheartedly tried to test this in a way that favors mongo. I also want it on the record that I completely understand that the current text search feature is marked as experimental and that it should only be used for testing purposes (which is the case for me so far). However, all evidence points to the fact that you plan on making this feature non-experimental in 2.6 with the $text query syntax and I believe that this is a big mistake given the current state of affairs.
I have a project under development that requires full-text searching capabilities and would otherwise map extremely well (schema-less collections: check, map/reduce: check) on the feature-set that mongodb currently provides. Because ideally, I would like a single point of truth design, I wanted to first test the performance of the full text search and indexing capabilities. To this end, I downloaded the english wikipedia datasource, wrote up an xml sax parser for it and started importing non-redirecting 1000 pages (schema:
{"title": "string", "text": "wikitext", "category": "integer"}) at a time into a collection that was configured with ensureIndex(
{"title": "text", "text": "text", "category": 1}) and no other indices defined. I also timed the mongo api call and calculated an average of rows (documents) inserted per second. Here are the results for the first 9 1000-row batches:
[root@dev1 var]# python import.py
Total: 1000. Speed: 56.00r/s
Total: 2000. Speed: 32.50r/s
Total: 3000. Speed: 13.74r/s
Total: 4000. Speed: 7.76r/s
Total: 5000. Speed: 4.69r/s
Total: 6000. Speed: 4.52r/s
Total: 7000. Speed: 3.37r/s
Total: 8000. Speed: 2.51r/s
Total: 9000. Speed: 2.73r/s
At this point, I killed the script because I lost my patience. My first impression was that something else must be going on. I used the unstable build from the 10gen RPM repository, so I immediately switched to the stable version. In addition, I moved the virtual disk image file (I did all of this in a virtual machine) from a raid1 setup to a "dumb" disk, preallocated it (it was previously a "smart" virtual drive that only expanded as needed by the guest os) and defragmented the host drive. While this did give me roughly a 50% boost, I still had the same logarithmic drop in throughput as the dataset size grew.
So I installed elasticsearch with a default config. Set up the index to do snowball analysis because based on your supported language list, I figured that's what you're using. Quite honestly, I was expecting it to outperform mongo because elasticsearch and lucene were designed for this exact purpose. However, the results are not even close:
Total: 1000. Speed: 405.83r/s
Total: 2000. Speed: 456.65r/s
Total: 3000. Speed: 411.27r/s
Total: 4000. Speed: 462.72r/s
Total: 5000. Speed: 407.93r/s
Total: 6000. Speed: 348.01r/s
Total: 7000. Speed: 286.52r/s
Total: 8000. Speed: 385.70r/s
Total: 9000. Speed: 419.75r/s
Total: 10000. Speed: 307.36r/s
Total: 11000. Speed: 266.52r/s
Total: 12000. Speed: 315.56r/s
Total: 13000. Speed: 466.00r/s
Total: 14000. Speed: 409.62r/s
Total: 15000. Speed: 359.44r/s
Total: 16000. Speed: 352.84r/s
Total: 17000. Speed: 403.31r/s
Total: 18000. Speed: 834.88r/s
Total: 19000. Speed: 1481.40r/s
Total: 20000. Speed: 418.26r/s
Total: 21000. Speed: 427.12r/s
Total: 22000. Speed: 1145.24r/s
Total: 23000. Speed: 370.15r/s
Total: 24000. Speed: 593.52r/s
Total: 25000. Speed: 497.98r/s
Total: 26000. Speed: 400.37r/s
Total: 27000. Speed: 474.05r/s
Total: 28000. Speed: 541.41r/s
Total: 29000. Speed: 611.02r/s
Total: 30000. Speed: 443.91r/s
Total: 31000. Speed: 467.19r/s
Total: 32000. Speed: 558.63r/s
Total: 33000. Speed: 549.81r/s
Total: 34000. Speed: 480.87r/s
Total: 35000. Speed: 612.75r/s
Total: 36000. Speed: 580.19r/s
Total: 37000. Speed: 482.62r/s
Total: 38000. Speed: 697.63r/s
Total: 39000. Speed: 681.52r/s
Total: 40000. Speed: 636.69r/s
...
Total: 1140000. Speed: 1288.25r/s
Total: 1141000. Speed: 1454.45r/s
Total: 1142000. Speed: 2061.57r/s
Total: 1143000. Speed: 1145.54r/s
Total: 1144000. Speed: 1775.82r/s
Total: 1145000. Speed: 1629.78r/s
Total: 1146000. Speed: 2178.34r/s
Total: 1147000. Speed: 2205.68r/s
Total: 1148000. Speed: 1658.26r/s
Total: 1149000. Speed: 2199.42r/s
Total: 1150000. Speed: 1018.77r/s
Total: 1151000. Speed: 1877.80r/s
Total: 1152000. Speed: 2337.72r/s
Total: 1153000. Speed: 1719.36r/s
Total: 1154000. Speed: 1906.16r/s
Total: 1155000. Speed: 970.74r/s
Total: 1156000. Speed: 1828.84r/s
Total: 1157000. Speed: 1958.14r/s
Total: 1158000. Speed: 1809.90r/s
Total: 1159000. Speed: 2240.17r/s
Total: 1160000. Speed: 1851.73r/s
Total: 1161000. Speed: 1731.13r/s
Total: 1162000. Speed: 1454.58r/s
Total: 1163000. Speed: 1933.39r/s
Total: 1164000. Speed: 2136.85r/s
Total: 1165000. Speed: 2001.85r/s
Total: 1166000. Speed: 1818.62r/s
Total: 1167000. Speed: 1341.41r/s
Total: 1168000. Speed: 2339.45r/s
Total: 1169000. Speed: 1619.38r/s
Total: 1170000. Speed: 2391.82r/s
Total: 1171000. Speed: 2239.17r/s
Total: 1172000. Speed: 2244.49r/s
Total: 1173000. Speed: 1848.49r/s
Total: 1174000. Speed: 2324.02r/s
Total: 1175000. Speed: 2042.02r/s
Total: 1176000. Speed: 1800.07r/s
Total: 1177000. Speed: 2091.75r/s
Total: 1178000. Speed: 1648.40r/s
Total: 1179000. Speed: 1245.07r/s
Total: 1180000. Speed: 1931.28r/s
Total: 1181000. Speed: 1752.95r/s
Total: 1182000. Speed: 1945.75r/s
Total: 1183000. Speed: 1856.97r/s
Total: 1184000. Speed: 1667.88r/s
Total: 1185000. Speed: 1361.52r/s
Total: 1186000. Speed: 2033.49r/s
Total: 1187000. Speed: 1682.91r/s
Total: 1188000. Speed: 1210.39r/s
Total: 1189000. Speed: 1491.82r/s
...
Total: 1436000. Speed: 737.77r/s
Total: 1437000. Speed: 1304.33r/s
Total: 1438000. Speed: 894.11r/s
Total: 1439000. Speed: 1344.08r/s
Total: 1440000. Speed: 480.54r/s
Total: 1441000. Speed: 1228.01r/s
Total: 1442000. Speed: 929.55r/s
Total: 1443000. Speed: 1165.60r/s
Total: 1444000. Speed: 1004.54r/s
Total: 1445000. Speed: 770.31r/s
Total: 1446000. Speed: 862.52r/s
Total: 1447000. Speed: 658.08r/s
Total: 1448000. Speed: 1465.76r/s
Total: 1449000. Speed: 1066.47r/s
Total: 1450000. Speed: 1347.94r/s
Total: 1451000. Speed: 956.64r/s
Total: 1452000. Speed: 1082.98r/s
Total: 1453000. Speed: 1002.42r/s
Total: 1454000. Speed: 1395.81r/s
Total: 1455000. Speed: 995.55r/s
Total: 1456000. Speed: 1153.25r/s
Total: 1457000. Speed: 1639.04r/s
Total: 1458000. Speed: 1049.73r/s
Total: 1459000. Speed: 563.95r/s
Total: 1460000. Speed: 1486.23r/s
Total: 1461000. Speed: 1098.64r/s
Total: 1462000. Speed: 1255.93r/s
Total: 1463000. Speed: 1357.39r/s
Total: 1464000. Speed: 929.26r/s
Total: 1465000. Speed: 861.80r/s
As you can see, while there are fluctuations in the throughput (which can be blamed on the dataset: some article batches are obviously larger than others), the trend is not a logarithmic drop.
As such, it is my opinion that you shouldn't release the full text feature as public until you solve the throughput issue. Anything else would simply be lying to your customers which expect huMONGOus dataset support. (hell, 10k documents could even be considered small by current standards).
If requested, I can provide the VM, dataset and import script I used for testing.
System specs:
Intel quad core i7 2600K @ 3.4GHz (two cores assigned to the VM)
16GB DDR3-SDRAM @ 1.3GHz (2GB assigned to the VM)
1TB WDC @ 7200RPM w/ 64MB buffer (128GB assigned to the VM; host OS runs on a different drive)
Virtualization used: Oracle VirtualBox 4.3.6 running CentOS 6.5