[SERVER-16544] mapReduce very slow with sort on emit key Created: 14/Dec/14  Updated: 04/Apr/15  Resolved: 03/Apr/15

Status: Closed
Project: Core Server
Component/s: MapReduce, Performance
Affects Version/s: 2.6.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Nick Milonakis Assignee: Ramon Fernandez Marina
Resolution: Cannot Reproduce Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian, MongoDB version: 2.6.5


Operating System: Linux
Steps To Reproduce:

1) on mongo shell paste following to define a js function to output test documents with n fields

function TestDoc (n) {
	var doc={}; 
	doc['lang']=['da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'nl', 'pt', 'ro', 'ru', 'sv', 'tr'][Math.floor(Math.random() * 13)]
	for (var i = 0;
	i < n; i++) { doc['fld_'+i]=Math.random().toString(34).slice(2)} 
	return doc;
}

2) define a function to insert nDocs documents with nFields to a collection

function InsertTestDocs (colName, nDocs, nFields) { 
	for (var i = 0;
	i < nNocs; i++) { doc=TestDoc(nFields); doc['_id']=i;
	db[colName].insert(doc)
	}
}

3) insert 1000000 test documents

insertTestDocs("tmp_col", 1000000, 100)

4) create index on 'lang' field

db.tmp_col.ensureIndex({lang: 1})

5) run a mapReduce job that simply sums distincts on lang field

db.runCommand({ mapreduce: "tmp_col", map: function () { emit(this.lang, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: { inline: 1 }})

6) you get results of the following form

"timeMillis" : 116705,
	"counts" : {
		"input" : 1000000,
		"emit" : 1000000,
		"reduce" : 65000,
		"output" : 13
	},
	"ok" : 1

7) run same map Reduce except specify a sort at this time

db.runCommand({ mapreduce: "tmp_col", map: function () { emit(this.lang, 1); }, reduce: function (key, values) { return Array.sum(values); }, sort:{lang:1}, out: { inline: 1 }})

7) you get following results

"timeMillis" : 1478708,
	"counts" : {
		"input" : 1000000,
		"emit" : 1000000,
		"reduce" : 8474,
		"output" : 13
	},
	"ok" : 1

Notice that now it takes 1478708 instead of 116705 (that is ~10X slower) when run without sort option

Participants:

 Description   

Map Reduce operations become very slow (> 1 order of magnitude slower) when run with sort option on emit field.
This is contrary to documentation. that states quite the oposite.
Kindly note:
1. that the delay is somehow proportional to number of fields on document and/or document complexity.
2. Also does not seem to be affected if map reduce outputs inline or in a collection.
3. I get about same results on a stand alone mongoDB and on a replicated one.



 Comments   
Comment by Ramon Fernandez Marina [ 04/Apr/15 ]

Thanks for getting back to me nickmilon. Glad to hear you've upgraded to 3.0.1 – lots of exciting new things in 3.0!

I'm not sure it makes a lot of sense to ask you to spend time chasing this on 2.6.9 if we'd then have to do the same thing here, and since you gave us a detail reproducer I'll be happy to test this on the v2.6 branch again and see what I find. I'll re-open the ticket if I can reproduce.

Cheers,
Ramón.

Comment by Nick Milonakis [ 04/Apr/15 ]

Hi Ramon,
The issue was there last time I checked with MongoDB version 2.6.9, unfortunately I haven't kept the logs since I have upgraded to V3.0.1.
Somehow it disappeared with new version so I am OK with it.
In case you want to investigate it for 2.6.9 just let me know and I will try to reproduced it.

Regards
Nick

Comment by Ramon Fernandez Marina [ 03/Apr/15 ]

Hi nickmilon, apologies for the "radio silence". I think I'm going to resolve this ticket, and if the issue appears again you may want to consider enabling the profiler and uploading the logs.

Regards,
Ramón.

Comment by Nick Milonakis [ 22/Dec/14 ]

Hi Ramon,

Thanks for your prompt reply,
I have replied on this before but somehow but today I relized my reply got misplaced somewhere so let me repeat it:

I tried those tests on a 8GB machine ~ 4GB free, then I freed some memory ~ 7GB free and rerun without any noticable diffirence.

Kindly let me know if you want me to try something else on my end.

Regards
Nick

Comment by Ramon Fernandez Marina [ 15/Dec/14 ]

Hi nickmilon,

thanks for the full, detailed reproducer. I'm unable to see this behavior on my end: using sort() makes the command run 4 times faster. This is on MacOS, but I'll try on Linux next. How much memory do you have on this machine?

Generated at Thu Feb 08 03:41:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.