[SERVER-3055] MapReduce Performance very slow compared to Hadoop Created: 06/May/11  Updated: 29/Feb/12  Resolved: 02/Dec/11

Status: Closed
Project: Core Server
Component/s: JavaScript
Affects Version/s: 1.8.0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Jim Olson Assignee: Antoine Girbal
Resolution: Duplicate Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux


Participants:

 Description   

I have run into a dilemma with MongoDB. We have been performing
some MapReduce benchmarks against Hadoop and have found MongoDB
to be a lot slower than Hadoop (65 minutes vs 2 minutes for a CPU-intensive
MapReduce job that basically breaks up strings and computes word counts
on large number of email texts (about 974 MB worth). I sharded the collection
across 3 servers and verified that it did get evenly distributed after using
db.printShardingStatus(); there are 7/8/7 chunks on the 3 shards.
And the collection is indexed.

Basically we have a couple questions:

Is there any alternative to using JavaScript for the Map and Reduce functions from the Java API? We think that the JavaScript may be slowing things down a lot.
Are there other overhead threads running that can be or should be disabled to speed up the MapReduce performance?

It just seems that this should execute a lot faster.

Thank you for any help,
Jim Olson

Kyle Banker's response to this was:

"These results aren't surprising. You're right that the JavaScript
engine is slow (and single-threaded). We're upgrading to V8, which may
help somewhat, but it still won't be as fast as, say, Hadoop.

MongoDB 2.0 will have a different, improved aggregation framework that
doesn't use JS. That will greatly improve aggregation for a lot of use
cases. I'd recommend that you create a JIRA issue for this use case so
that we can track interest and make sure that the new framework can
support it."

So this is my JIRA ticket.
Please let me know if I can provide further details.
Thank you. jamesolson@noviidesign.com



 Comments   
Comment by Antoine Girbal [ 02/Dec/11 ]

the speed will be improved by the switch to v8 (should be 2-3x faster), so marking as duplicate. Please reopen if more questions or post on mongodb-user group for troubleshooting MR.

Comment by Antoine Girbal [ 12/Oct/11 ]

Jim,
did you have a chance to do further testing with your map/reduce?
If you are still able to test this with mongo there are several improvements that can make it perform faster in 2.0.
This includes:

  • less disk I/O on regular job
  • a mode where everything stays in javascript with no disk I/O
  • better V8 support.
    What is the data size you are computing?
    thanks
Comment by Jim Olson [ 10/May/11 ]

I got the bigger job to complete by adding a line of JavaScript to the Map function to only emit if the word length is greater than 1. The job then completed in 20.3 minutes. But why were the exceptions occurring above? It was counting an additional 37 words (a-z, 0-9, and the empty string) and the empty string had about 13 M occurrences.
It concerns me that the exceptions above killed the MapReduce jobs. Thanks

Comment by Jim Olson [ 09/May/11 ]

The same abend happened the 2nd time. The 261277 is not the number of distinct words (it is 167258) – I don't know what the 261277 represents. I can see from the first (successful shorter) job that the maximum datum was for an empty string which had a count of 13205099 which when multiplied by 10 (for the big set) would yield 132050990 or 7DEF02E hex. I was thinking the count might have exceeded the range of an integer but it shouldn't.

Comment by Jim Olson [ 09/May/11 ]

The big job abended with an odd error:
com.mongodb.CommandResult$CommandFailure: command failed [command failed [mapreduce] {"cause" : {"assertion" : "Invalid BSONObj size: 18597080 (0xD8C41B01) first element: 0: 261677" , "assertionCode" : 10334 , "errmsg" : "db assertion failure", "ok" : 0.0, "errmsg" : "mongo mr failed:

{ assertion: \"Invalid BSONObj size: 18597080 (0xD8C41B01) first element: 0: 261277.0\", assertionCode : 10334, errmsg: \"db assertion failure\", ok: 0.0 }

"}
at com.mongodb.CommandResult.getException(CommandResult.java:69)
at com.mongodb.CommandResult.throwOnError(CommandResult.java:79)
at com.mongodb.DBCollection.mapReduce(DBCollection.java:961)
at my code where it invoked the mapReduce job on the collection.

What is confusing to me is the decimal and hex sizes differ. The 261677 is the number of distinct words in the data set. The data is a 974 MB collection of email texts spread over 3 sharded servers.

Comment by Eliot Horowitz (Inactive) [ 09/May/11 ]

Should also be a lot faster in 2.0, so should see what happens there.

Comment by Jim Olson [ 09/May/11 ]

Eliot, thanks. I tried this and it works. It's about 50% faster.
It took about 3 minutes 18 seconds on 10% of the original data set,
so extrapolating that out it would be 33.1 minutes, roughly half the
time of the original run. Definitely a marked improvement, but still
slow compared to Hadoop. I will run the full job again just to see
if it differs from this extrapolation and let you know.

Comment by Eliot Horowitz (Inactive) [ 09/May/11 ]

It looks like you basically wrote your own map/reduce engine inside of the map/reduce engine.

Try this

function () {
var b = this.body.toLowercase();
var re = /]|\\u005c|
u000d|[- \t,.<>()[{}/?!|*'\"`~+=_&^%;:#@$]/;
var arr = b.split(re);
for (var i = 0; i < arr.length; i++)

{ var word = arr[i]; emit( word , 1 ); }

;
};

The reduce function is:

function(key, values) {
return Array.sum( values );
};

Comment by Jim Olson [ 09/May/11 ]

Eliot,
The map function is:

function () {
var wordsHash = {};
var b = this.body.toLowercase();
var re = /]|\\u005c|
u000d|[- \t,.<>()[{}/?!|*'\"`~+=_&^%;:#@$]/;
var arr = b.split(re);
for (var i = 0; i < arr.length; i++) {
var word = arr[i];
if (word.length > 1)

{ if (wordsHash[word]) wordsHash[word] += 1; else wordsHash[word] = 1; }

;
};
emit (\"mr\", wordsHash);
};

The reduce function is:

function(key, values) {
var wordHashTotals = {};
for (var i = 0; i < values.length; i++) {
var wordHash = values[i];
for (word in wordHash)

{ var wordCount = wordHash[word]; if (wordHashTotals[word]) wordHashTotals[word] += wordCount; else wordHashTotals[word] = wordCount; }

}
return wordHashTotals;
};

The rest is just a simple java program that creates a MapReduceCommand object on
the data collection and then submits the job.

Hope this helps. I looked it over and I don't think there are any typos.
I do get correct results with it, it just takes a long time.
I had to create 3 groups for the regexp because it caused java syntax errors
to have the ] and backslash and \n chars in the big group.

Regards,
Jim Olson

Comment by Eliot Horowitz (Inactive) [ 09/May/11 ]

First, one option is to use hadoop for processing with the data input and output in mongo.
See: https://github.com/mongodb/mongo-hadoop

2nd, can you send the code your'e using?
There are definitely ways it can be optimized.

Also, the new aggregration framework might make things much faster.
All depends on exactly what your'e doing.

Generated at Thu Feb 08 03:01:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.