[SERVER-15131] Insert performance to a fulltext-enabled collection Created: 03/Sep/14  Updated: 07/Oct/14  Resolved: 25/Sep/14

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 2.6.2
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Jaan Paljasma Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: fulltext, text_index
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OSX 10.9.4, MongoDB 2.6.0


Participants:

 Description   

I am experiencing slowdown when inserting documents into a collection with more than 650,000 documents in it. The data used is wikipedia content dump. It started fast as always, and I was getting decent ~1000 documents/second, however the performance degraded after every batch of 3000 documents, and now at the 666000 documents the speed is less than 10 documents per second. My goal was to insert 1M documents to test full text speed, which seems now forever at this pace.

Is this by design?

Here are the indexes:

db.content.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "mediawiki.content"
	},
	{
		"v" : 1,
		"key" : {
			"_fts" : "text",
			"_ftsx" : 1
		},
		"name" : "fulltext_index",
		"ns" : "mediawiki.content",
		"weights" : {
			"body" : 1,
			"title" : 1
		},
		"default_language" : "english",
		"language_override" : "language",
		"textIndexVersion" : 2
	}
]

and stats

db.content.stats()
{
	"ns" : "mediawiki.content",
	"count" : 663429,
	"size" : 3283468976,
	"avgObjSize" : 4949,
	"storageSize" : 3918798848,
	"numExtents" : 22,
	"nindexes" : 2,
	"lastExtentSize" : 1021497344,
	"paddingFactor" : 1,
	"systemFlags" : 1,
	"userFlags" : 1,
	"totalIndexSize" : 6679227856,
	"indexSizes" : {
		"_id_" : 21560112,
		"fulltext_index" : 6657667744
	},
	"ok" : 1
}



 Comments   
Comment by Ramon Fernandez Marina [ 07/Oct/14 ]

After some further investigation my colleage adinoyi.omuya@10gen.com has found the cause of the behavior you describe: the records in your corpus get bigger and bigger, so it is not surprising it takes longer to insert them.

The JSON string representing the first record in your corpus is 221 bytes long:

{"body": "'''Wei-chi''' may refer to:\n*The [[game of go]]\n*The [[Chinese word for \"crisis\"]]\n\n{{dab}}", "timestamp": {"$date": 1409836258000}, "page_id": 747205, "user": "TheQ Editor", "title": "Wei-chi"}

while the last one is 57231 bytes long.

I created a corpus of 1M records, each 221 bytes long as the one above, and the insert speed is constant as shown in the log below:

connected to: 127.0.0.1
2014-10-07T17:45:44.002-0400                    19500   6500/second
2014-10-07T17:45:47.010-0400                    42800   7133/second
2014-10-07T17:45:50.006-0400                    67000   7444/second
2014-10-07T17:45:53.003-0400                    91500   7625/second
2014-10-07T17:45:56.004-0400                    115600  7706/second
2014-10-07T17:45:59.013-0400                    139400  7744/second
2014-10-07T17:46:02.004-0400                    162800  7752/second
2014-10-07T17:46:05.003-0400                    186900  7787/second
2014-10-07T17:46:08.002-0400                    210600  7800/second
2014-10-07T17:46:11.004-0400                    234700  7823/second
2014-10-07T17:46:14.004-0400                    258600  7836/second
2014-10-07T17:46:17.003-0400                    282900  7858/second

When I create a corpus of records like the bigger ones in your corpus I do get a lower insert speed as expected, but it's also constant:

connected to: 127.0.0.1
2014-10-07T18:30:29.231-0400                    300     100/second
2014-10-07T18:30:32.030-0400                    600     100/second
2014-10-07T18:30:35.728-0400                    1000    111/second
2014-10-07T18:30:38.553-0400                    1300    108/second
2014-10-07T18:30:41.388-0400                    1500    100/second
2014-10-07T18:30:44.176-0400                    1800    100/second
2014-10-07T18:30:47.011-0400                    2100    100/second

In summary, the change in insert speed you're observing is inversely proportional to the size of the records you're inserting.

Regards,
Ramón.

Comment by Jaan Paljasma [ 30/Sep/14 ]

Hi Ramon,

I assume that "works as designed" may sound comfortable, but this particular scenario is an excellent test case revealing room for improvement in MongoDB full text. Suggestion to drop index, insert document(s) and then recreate index on production environment is not an option to anyone: tested and it took about 11 minutes to regenerate full text index for only 500,000 documents.

Questions:
Shard key cannot contain index with full text - then how full text search works with sharded environment? I was having hard time looking it up. It it will work, it will be scatter/gather anyways I assume.
Is MongoDB considered a viable solution when you have unpredicted number of inserts per second, and when your collection size exceeds certain size?
Is it possible to consider using multiple CPU cores (when available) when indexing?

Comment by Ramon Fernandez Marina [ 25/Sep/14 ]

jaan@hebsdigital.com, after running some tests on our end it seems that updating the index is killing insert performance. With this many documents, you may be better off inserting the data first and then building the index. Every time you need to insert another large number of documents, performance may be better if you drop the index before inserting documents and rebuild it immediately afterwards.

If you want to discuss other options please post on the mongodb-user group or Stack Overflow with the mongodb tag, where you'll reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 08/Sep/14 ]

Thanks for the additional information and detailed instructions jaan@hebsdigital.com, we'll take a look and let you know what we find.

Regards,
Ramón.

Comment by Jaan Paljasma [ 08/Sep/14 ]

Update: I have exported data from MySQL to JSON dump file and using mongoimport to import the data to a collection with full text index already enabled.

Furthermore, I posted new mediawiki.json.tar.gz to you. You can extract it, and replay the scenario by following:

1) first create full text index on collection "content":

db.content.ensureIndex({title:"text",body:"text"}, {name:'fulltext_index'})

2) run mongoimport:

$ mongoimport -d mediawiki -c content < mediawiki.json

3) take time.

2014-09-08T14:12:57.922-0400 			512800	55/second

Comment by Jaan Paljasma [ 04/Sep/14 ]

Here you go,

$ vm_stat 1
Mach Virtual Memory Statistics: (page size of 4096 bytes)
    free   active   specul inactive throttle    wired  prgable   faults     copy    0fill reactive   purged file-backed anonymous cmprssed cmprssor  dcomprs   comprs  pageins  pageout  swapins swapouts
    2077   772147     9757   762732        0   478051       24 3014321K  563379K  924820K  146853K  9158629       90084   1454552  4234273  2168762 54315905 66312019 27886688  1118946 12401053 13244257 
    2003   772230     9650   762850        0   477978       22     1424       61       68      146        2       90106   1454624  4233609  2168392      737       73      118        0        0        0 
    2336   772759    10542   762175        0   477882       22     3401       47     1050      264       37       91366   1454110  4232785  2167819     1173      349      101        0      126        0 
    2006   773058    10248   762985        0   477516       22     1154       71      203        6        0       91173   1455118  4232190  2167563      595        0      132        2        0        0 
    1552   773447    10948   762296        0   477130       22     1472       60      101      331        0       91978   1454713  4231650  2167220      813      273      126        5        0        0 
    2269   773616    10943   762599        0   476684       22      735       56        6        2        0       92041   1455117  4231210  2166931      440        0      106        0        0        0 
    1852   774237    11086   762897        0   476618       22     1724       82       11        0        0       92298   1455922  4230289  2166576      921        0      136        0        0        0 
    1933   774193    10608   763631        0   476721       22     2832       50     1150        0       37       92144   1456288  4229501  2165988      788        0       91        0        0        0 
    1803   774528    10497   764327        0   476300       22      869       64       19        0        0       92137   1457215  4228986  2165775      515        0      119        0        0        0 
    2002   774521    10531   764134        0   476280       22     1287       65       28     1104        0       92183   1457003  4228737  2165589      485      236      115        2        0        0 
    1871   775168    10377   764123        0   476191       22     1421       46      129        4        0       92150   1457518  4228042  2165583      752       57       70        0       60        0 
    2091   773817     9115   765095        0   477567       22     2839       58       81      531        0       91154   1456873  4228559  2165692      441      958      154        0       60        0 
    2246   774379     8702   765010        0   476285       22     2896       54     1093     1560       37       91035   1457056  4229587  2166459      629     1657      121        0      192        0 
    2248   775144     9940   764890        0   475822       22     1563       54      100      673        0       92063   1457911  4229155  2166009      983      551      108        0        0        0 
    2256   774944     9476   765325        0   475758       22     1047       82      198      406        0       91716   1458029  4229135  2166014      482      462      134        0       63        0 
    2360   774779     8771   766190        0   475775       22     1181       80       21      128        0       91133   1458607  4228650  2165574      705      220      151        3        0        0 
    2056   774959     8887   766065        0   476121       28     5595       63     3334      241        0       91339   1458572  4228578  2165579      378      316      121        0        0        0 
    2216   774820     8228   766537        0   476136       69     3234       20     2306        0        6       90707   1458878  4228418  2165542      160        0       38        0        0        0 
    1665   775475     8055   766922        0   475798       22     3113       73     2171        0       47       90628   1459824  4227874  2165336      544        0      113        0       62        0 
    2000   774679     6864   768125        0   475822       22     1732       33      701        0        0       89489   1460179  4227567  2165033      307        0       72        0        0        0 
    2326   774773     7652   767485        0   475804       22     1176       82       12      826        0       90454   1459456  4228324  2165374      716     1473      140        2       58        0 
Mach Virtual Memory Statistics: (page size of 4096 bytes)
    free   active   specul inactive throttle    wired  prgable   faults     copy    0fill reactive   purged file-backed anonymous cmprssed cmprssor  dcomprs   comprs  pageins  pageout  swapins swapouts
    2166   774712     7028   767846        0   476121       22 3014363K  563380K  924833K  146859K  9158795       89945   1459641  4227840  2165121 54328956 66318647 27889089  1118960 12401674 13244257 
    2194   774221     7790   766875        0   476149       22     4438       70     3064     1440       37       90802   1458084  4229360  2165822      472     1992      156        2        0        0 
    2033   774755     8371   766114        0   475789       22      850       64       10     1108        0       91481   1457759  4230077  2166114      527     1244      116        8       63        0 
    1835   774934     8755   766421        0   475795       22      602       61       22        0        0       91964   1458146  4229750  2166062      327        0      121        0        0        0 
    2382   774614     8604   765903        0   475820       22     1362       77      320     1011        0       91941   1457180  4230906  2166486      635     1791      141       14        0        0 
    2784   773666     8722   765279        0   475846       22      498       19       63     1408        0       92091   1455576  4232556  2167340      267     1917       48        0       61        0 
    1689   772763     9270   764129        0   476200       22     4682      101     1340     2257       37       93098   1453064  4234951  2169206     1265     3660      206        0      934        0 
    1823   773339     9040   764443        0   475744       22      729       47      121        0        0       93064   1453758  4234632  2169518      319        0      134        0      383        0 
    2025   771708     9415   763388        0   476282       22     3338       55     1477     2195        0       93598   1450913  4237342  2170769      617     3327      196        0       62        0 
    1998   771937     8769   763361        0   475895       24     4572       50     2179      411        0       93083   1450984  4237324  2170777      740      722      106        0       62        0 
    2389   771511     9273   763097        0   476084       22     1636       48      973      402        2       93688   1450193  4237824  2170898      271      771       90        0        0        0 
    2200   772477     9144   763172        0   475670       22     4025       84     2558        0       37       93679   1451114  4237304  2170645      520        0      144        0        0        0 
    2109   772631     9192   763694        0   475200       22     1350       75      356      267        0       93855   1451662  4237197  2170581      518      411      154        0        0        0 
    2077   771883     9141   763046        0   475948       22     1333       76      119      482        0       93926   1450144  4237879  2170957      743     1425      133        0      118        0 
    1664   772609     9389   762981        0   475497       22      470       30       13        0        0       94204   1450775  4237616  2170882      263        0       70        0       61        0 
    2363   772518     8753   763865        0   475178       22     1448      113       10      553        0       93759   1451377  4237434  2170821      847      665      218        2      108        0 
    1808   772607     8286   764141        0   475535       22     2560       77     1302        0       37       93404   1451630  4237059  2170658      375        0      136        0       61        0 
    2000   772936     8098   764418        0   475181       22      997       70       11      413        0       93323   1452129  4236986  2170624      594      521      118        0        0        0 
    1945   772337     8050   764419        0   476729       22     1104       14       10        0        0       93304   1451502  4236793  2170475      193        0       42        0        0        0 
   21536   762749     8376   754671        0   475111       22     4111        0      102      390        0       93984   1431812  4235651  2170446     1557      532       69        2       62        0 
   19895   763332     9810   754715        0   475122       22      901        0       19        0        0       95462   1432395  4235052  2170432      599        0       22        0       62        0 
Mach Virtual Memory Statistics: (page size of 4096 bytes)
    free   active   specul inactive throttle    wired  prgable   faults     copy    0fill reactive   purged file-backed anonymous cmprssed cmprssor  dcomprs   comprs  pageins  pageout  swapins swapouts
   18228   763537    11041   754701        0   475482       59 3014407K  563381K  924848K  146871K  9158945       96707   1432572  4234557  2170524 54341100 66337625 27891552  1118988 12403835 13244257 
   16544   763443    12283   754695        0   476040       59      653        0       33        0        0       97949   1432472  4234106  2170513      451        0       24        0       61        0 
   13898   764561    13615   754699        0   475646       59      970        0      168        0        0       99290   1433585  4233516  2170638      590        0       23        0      125        0 
   10678   765130    14729   754696        0   475793       59      987        0       85        0        0      100439   1434116  4233030  2171599      486        0       52        0      996        0 
    8277   766165    16469   754952        0   475819       59     1638        0       86        0        0      102468   1435118  4231991  2171921     1039        0       61        0      508        0 
    5615   766659    18314   754936        0   476178       96     2714        0     1064       16        0      104313   1435596  4231174  2171825      817        0       26        0       60        0 
    3944   767703    19690   754936        0   475829       96      888        0      196        0        0      105689   1436640  4230647  2171740      527        0       24        0        0        0 
    2289   768092    20472   754926        0   475873       22      713        0       91        0       74      106472   1437018  4230176  2171676      471        0       13        0        0        0 
    4151   768681    20776   755186        0   475730       22     1345        0       13        0        0      107037   1437606  4229545  2168453      631        0       12        0        0        0 
    1577   769604    22084   755242        0   475734       22     1727        0       27        0        0      108405   1438525  4228565  2168474      980        0       19        0       64        0 
    2256   769411    21959   755394        0   476102       22     1974        0     1140        0       37      108434   1438330  4228362  2168474      203        0        7        0        0        0 
    2000   774582    15540   759205        0   475765       22     9515        0       55        0        0      102308   1447019  4220046  2165381     8316        0      105        0        0        0 
    2463   774882    16020   758928        0   475768       22      757        0      116        0        0      102311   1447519  4219572  2165252      474        0       43        0        0        0 

Comment by Ramon Fernandez Marina [ 04/Sep/14 ]

Thanks jaan@hebsdigital.com. The mongostat info you sent shows hundreds of page faults per second. Could you run vm_stat 1 on a terminal and post the output after a few iterations? Something like this (but with some more iterations):

$ vm_stat 1
Mach Virtual Memory Statistics: (page size of 4096 bytes)
    free   active   specul inactive throttle    wired  prgable   faults     copy    0fill reactive   purged file-backed anonymous cmprssed cmprssor  dcomprs   comprs  pageins  pageout  swapins swapouts
  996666  1032883    70062  1474723        0   446965   578522 1298150K 25370683 1461411K 35716472  9176784      760483   1817185   724792   172516  192548K  201973K 15833912    34306 10876102 11529463 
  995270  1042152    70062  1474373        0   439346   580122    12464     6208     3645        0        0      760473   1826114   724792   172516        0        0        0        0        0        0 
  995488  1042060    70062  1474373        0   439254   580122      477        5      465        0        0      760473   1826022   724792   172516        1        0        0        0        0        0 

Given the large size of the index I'm wondering whether your system is busy handling page faults.

Comment by Jaan Paljasma [ 04/Sep/14 ]

Ramon, this is a standalone machine 16GB RAM, OSX 10.9.4, db.version() reports 2.6.0, no replication. I can give you a dataset I use once current application finishes; it will be MySQL dump (tar.gz) containing 1M rows.

Comment by Ramon Fernandez Marina [ 04/Sep/14 ]

jaan@hebsdigital.com, is this on a sharded setup or in a stand-alone machine? Are you using replication? How much memory is installed in the machine(s) where mongod is running? You mention 2.6.0 and 2.6.2, can you clarify which version you're using?

Also it could be useful to get the exact same dataset you're using so we can try to reproduce this behavior on our end. If you would consider sharing it please let us know and we can send upload instructions.

Thanks,
Ramón.

Comment by Jaan Paljasma [ 04/Sep/14 ]

Here's the sample of mongostat revealing that database is locked for most of the time during inserts.
https://www.dropbox.com/s/v8pwibq4nh8xmtc/Screenshot%202014-09-04%2009.58.30.png?dl=0

At the moment the process is still at 925,000 documents, insert performance is now ~ 2 documents / sec.

Generated at Thu Feb 08 03:37:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.