<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 03:01:57 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-3055] MapReduce Performance very slow compared to Hadoop</title>
                <link>https://jira.mongodb.org/browse/SERVER-3055</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;I have run into a dilemma with MongoDB.  We have been performing&lt;br/&gt;
some MapReduce benchmarks against Hadoop and have found MongoDB&lt;br/&gt;
to be a lot slower than Hadoop (65 minutes vs 2 minutes for a CPU-intensive&lt;br/&gt;
MapReduce job that basically breaks up strings and computes word counts&lt;br/&gt;
on large number of email texts (about 974 MB worth).  I sharded the collection&lt;br/&gt;
across 3 servers and verified that it did get evenly distributed after using&lt;br/&gt;
db.printShardingStatus(); there are 7/8/7 chunks on the 3 shards.&lt;br/&gt;
And the collection is indexed.&lt;/p&gt;

&lt;p&gt;Basically we have a couple questions:&lt;/p&gt;

&lt;p&gt;    Is there any alternative to using JavaScript for the Map and Reduce functions from the Java API?  We think that the JavaScript may be slowing things down a lot.&lt;br/&gt;
    Are there other overhead threads running that can be or should be disabled to speed up the MapReduce performance?&lt;/p&gt;

&lt;p&gt;It just seems that this should execute a lot faster.&lt;/p&gt;

&lt;p&gt;Thank you for any help,&lt;br/&gt;
Jim Olson&lt;/p&gt;

&lt;p&gt;Kyle Banker&apos;s response to this was:&lt;/p&gt;

&lt;p&gt;&quot;These results aren&apos;t surprising. You&apos;re right that the JavaScript&lt;br/&gt;
engine is slow (and single-threaded). We&apos;re upgrading to V8, which may&lt;br/&gt;
help somewhat, but it still won&apos;t be as fast as, say, Hadoop.&lt;/p&gt;

&lt;p&gt;MongoDB 2.0 will have a different, improved aggregation framework that&lt;br/&gt;
doesn&apos;t use JS. That will greatly improve aggregation for a lot of use&lt;br/&gt;
cases. I&apos;d recommend that you create a JIRA issue for this use case so&lt;br/&gt;
that we can track interest and make sure that the new framework can&lt;br/&gt;
support it.&quot;&lt;/p&gt;

&lt;p&gt;So this is my JIRA ticket.&lt;br/&gt;
Please let me know if I can provide further details.&lt;br/&gt;
Thank you.    jamesolson@noviidesign.com&lt;/p&gt;
</description>
                <environment>Linux</environment>
        <key id="16711">SERVER-3055</key>
            <summary>MapReduce Performance very slow compared to Hadoop</summary>
                <type id="4" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14710&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="antoine">Antoine Girbal</assignee>
                                    <reporter username="jimo555">Jim Olson</reporter>
                        <labels>
                    </labels>
                <created>Fri, 6 May 2011 14:19:19 +0000</created>
                <updated>Wed, 29 Feb 2012 03:54:03 +0000</updated>
                            <resolved>Fri, 2 Dec 2011 07:29:18 +0000</resolved>
                                    <version>1.8.0</version>
                                                    <component>JavaScript</component>
                                        <votes>5</votes>
                                    <watches>6</watches>
                                                                                                                <comments>
                            <comment id="70365" author="antoine" created="Fri, 2 Dec 2011 07:29:18 +0000"  >&lt;p&gt;the speed will be improved by the switch to v8 (should be 2-3x faster), so marking as duplicate. Please reopen if more questions or post on mongodb-user group for troubleshooting MR.&lt;/p&gt;</comment>
                            <comment id="59959" author="antoine" created="Wed, 12 Oct 2011 04:05:45 +0000"  >&lt;p&gt;Jim,&lt;br/&gt;
did you have a chance to do further testing with your map/reduce?&lt;br/&gt;
If you are still able to test this with mongo there are several improvements that can make it perform faster in 2.0.&lt;br/&gt;
This includes:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;less disk I/O on regular job&lt;/li&gt;
	&lt;li&gt;a mode where everything stays in javascript with no disk I/O&lt;/li&gt;
	&lt;li&gt;better V8 support.&lt;br/&gt;
What is the data size you are computing?&lt;br/&gt;
thanks&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="31841" author="jimo555" created="Tue, 10 May 2011 11:13:59 +0000"  >&lt;p&gt;I got the bigger job to complete by adding a line of JavaScript to the Map function to only emit if the word length is greater than 1.  The job then completed in 20.3 minutes.  But why were the exceptions occurring above?  It was counting an additional 37 words (a-z, 0-9, and the empty string) and the empty string had about 13 M occurrences.&lt;br/&gt;
It concerns me that the exceptions above killed the MapReduce jobs.  Thanks&lt;/p&gt;</comment>
                            <comment id="31696" author="jimo555" created="Mon, 9 May 2011 20:31:54 +0000"  >&lt;p&gt;The same abend happened the 2nd time.  The 261277 is not the number of distinct words (it is 167258) &amp;#8211; I don&apos;t know what the 261277 represents.  I can see from the first (successful shorter) job that the maximum datum was for an empty string which had a count of 13205099 which when multiplied by 10 (for the big set) would yield 132050990 or 7DEF02E hex.  I was thinking the count might have exceeded the range of an integer but it shouldn&apos;t.  &lt;/p&gt;</comment>
                            <comment id="31687" author="jimo555" created="Mon, 9 May 2011 20:06:45 +0000"  >&lt;p&gt;The big job abended with an odd error:&lt;br/&gt;
com.mongodb.CommandResult$CommandFailure: command failed [command failed &lt;span class=&quot;error&quot;&gt;&amp;#91;mapreduce&amp;#93;&lt;/span&gt; {&quot;cause&quot; : {&quot;assertion&quot; : &quot;Invalid BSONObj size: 18597080 (0xD8C41B01) first element: 0: 261677&quot; , &quot;assertionCode&quot; : 10334 , &quot;errmsg&quot; : &quot;db assertion failure&quot;, &quot;ok&quot; : 0.0, &quot;errmsg&quot; : &quot;mongo mr failed: &lt;/p&gt;
{ assertion: \&quot;Invalid BSONObj size: 18597080 (0xD8C41B01) first element: 0: 261277.0\&quot;, assertionCode : 10334, errmsg: \&quot;db assertion failure\&quot;, ok: 0.0 }
&lt;p&gt;&quot;}&lt;br/&gt;
at com.mongodb.CommandResult.getException(CommandResult.java:69)&lt;br/&gt;
at com.mongodb.CommandResult.throwOnError(CommandResult.java:79)&lt;br/&gt;
at com.mongodb.DBCollection.mapReduce(DBCollection.java:961)&lt;br/&gt;
at my code where it invoked the mapReduce job on the collection.&lt;/p&gt;

&lt;p&gt;What is confusing to me is the decimal and hex sizes differ.  The 261677 is the number of distinct words in the data set.  The data is a 974 MB collection of email texts spread over 3 sharded servers. &lt;/p&gt;</comment>
                            <comment id="31660" author="eliot" created="Mon, 9 May 2011 19:08:06 +0000"  >&lt;p&gt;Should also be a lot faster in 2.0, so should see what happens there.&lt;/p&gt;</comment>
                            <comment id="31658" author="jimo555" created="Mon, 9 May 2011 18:56:07 +0000"  >&lt;p&gt;Eliot, thanks.  I tried this and it works.  It&apos;s about 50% faster.&lt;br/&gt;
It took about 3 minutes 18 seconds on 10% of the original data set,&lt;br/&gt;
so extrapolating that out it would be 33.1 minutes, roughly half the&lt;br/&gt;
time of the original run.  Definitely a marked improvement, but still&lt;br/&gt;
slow compared to Hadoop.  I will run the full job again just to see&lt;br/&gt;
if it differs from this extrapolation and let you know.  &lt;/p&gt;</comment>
                            <comment id="31574" author="eliot" created="Mon, 9 May 2011 12:54:24 +0000"  >&lt;p&gt;It looks like you basically wrote your own map/reduce engine inside of the map/reduce engine.&lt;/p&gt;

&lt;p&gt;Try this&lt;/p&gt;



&lt;p&gt;function () { &lt;br/&gt;
    var b = this.body.toLowercase(); &lt;br/&gt;
    var re = /]|\\u005c|&lt;br class=&quot;atl-forced-newline&quot; /&gt;u000d|[- \t,.&amp;lt;&amp;gt;()&lt;span class=&quot;error&quot;&gt;&amp;#91;{}/?!|*&amp;#39;\&amp;quot;`~+=_&amp;amp;^%;:#@$&amp;#93;&lt;/span&gt;/; &lt;br/&gt;
    var arr = b.split(re); &lt;br/&gt;
    for (var i = 0; i &amp;lt; arr.length; i++) &lt;/p&gt;
{ 
        var word = arr[i]; 
        emit( word , 1 );
    }
&lt;p&gt;; &lt;br/&gt;
}; &lt;/p&gt;


&lt;p&gt;The reduce function is: &lt;/p&gt;

&lt;p&gt;function(key, values) { &lt;br/&gt;
    return Array.sum( values );&lt;br/&gt;
}; &lt;/p&gt;

</comment>
                            <comment id="31572" author="jimo555" created="Mon, 9 May 2011 12:49:11 +0000"  >&lt;p&gt;Eliot,&lt;br/&gt;
The map function is:&lt;/p&gt;

&lt;p&gt;function () {&lt;br/&gt;
    var wordsHash = {};&lt;br/&gt;
    var b = this.body.toLowercase();&lt;br/&gt;
    var re = /]|\\u005c|&lt;br class=&quot;atl-forced-newline&quot; /&gt;u000d|[- \t,.&amp;lt;&amp;gt;()&lt;span class=&quot;error&quot;&gt;&amp;#91;{}/?!|*&amp;#39;\&amp;quot;`~+=_&amp;amp;^%;:#@$&amp;#93;&lt;/span&gt;/;&lt;br/&gt;
    var arr = b.split(re);&lt;br/&gt;
    for (var i = 0; i &amp;lt; arr.length; i++) {&lt;br/&gt;
        var word = arr&lt;span class=&quot;error&quot;&gt;&amp;#91;i&amp;#93;&lt;/span&gt;;&lt;br/&gt;
        if (word.length &amp;gt; 1) &lt;/p&gt;
{
            if (wordsHash[word])
                wordsHash[word] += 1;
            else
                wordsHash[word] = 1;
        }
&lt;p&gt;;&lt;br/&gt;
    };&lt;br/&gt;
    emit (\&quot;mr\&quot;, wordsHash);&lt;br/&gt;
};&lt;/p&gt;


&lt;p&gt;The reduce function is:&lt;/p&gt;

&lt;p&gt;function(key, values) {&lt;br/&gt;
    var wordHashTotals = {};&lt;br/&gt;
    for (var i = 0; i &amp;lt; values.length; i++) {&lt;br/&gt;
        var wordHash = values&lt;span class=&quot;error&quot;&gt;&amp;#91;i&amp;#93;&lt;/span&gt;;&lt;br/&gt;
        for (word in wordHash) &lt;/p&gt;
{
             var wordCount = wordHash[word];
             if (wordHashTotals[word])
                  wordHashTotals[word] += wordCount;
             else
                  wordHashTotals[word] = wordCount;
         }
&lt;p&gt;    }&lt;br/&gt;
    return wordHashTotals;&lt;br/&gt;
};&lt;/p&gt;

&lt;p&gt;The rest is just a simple java program that creates a MapReduceCommand object on&lt;br/&gt;
the data collection and then submits the job.&lt;/p&gt;

&lt;p&gt;Hope this helps.  I looked it over and I don&apos;t think there are any typos.&lt;br/&gt;
I do get correct results with it, it just takes a long time.&lt;br/&gt;
I had to create 3 groups for the regexp because it caused java syntax errors&lt;br/&gt;
to have the ] and backslash and \n chars in the big group.&lt;/p&gt;

&lt;p&gt;Regards,&lt;br/&gt;
Jim Olson&lt;/p&gt;
</comment>
                            <comment id="31541" author="eliot" created="Mon, 9 May 2011 05:50:50 +0000"  >&lt;p&gt;First, one option is to use hadoop for processing with the data input and output in mongo.  &lt;br/&gt;
See: &lt;a href=&quot;https://github.com/mongodb/mongo-hadoop&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/mongodb/mongo-hadoop&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2nd, can you send the code your&apos;e using?&lt;br/&gt;
There are definitely ways it can be optimized.&lt;/p&gt;

&lt;p&gt;Also, the new aggregration framework might make things much faster.&lt;br/&gt;
All depends on exactly what your&apos;e doing.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 9 May 2011 05:50:50 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        12 years, 11 weeks, 5 days ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>ian@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            12 years, 11 weeks, 5 days ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10000" key="com.atlassian.jira.plugin.system.customfieldtypes:radiobuttons">
                        <customfieldname>Old_Backport</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10000"><![CDATA[No]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>antoine</customfieldvalue>
            <customfieldvalue>eliot</customfieldvalue>
            <customfieldvalue>jimo555</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hrp09j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hrifaf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>21129</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hs9wlz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>