[SERVER-6869] MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database) Created: 27/Aug/12  Updated: 15/Feb/13  Resolved: 27/Aug/12

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.2.0-rc0
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: José Manuel Ciges Regueiro Assignee: Ben Becker
Resolution: Done Votes: 0
Labels: mapreduce
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Suse Linux Enterprise Server 10 (Virtual Server on Xen)
RAM: 10 Gb
Cores: 32 (AMD)


Participants:

 Description   

I am doing now some data analyze tests and in the first, really simple I have got very strange results.

The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains

The script I have made in JavaScript is really simple :

/* Counts each domain url */
m = function () {
    emit(this.domain, 1 );
}
 
r = function (key, values)    {
    total = 0;
    for (var i in values)    {
        total += Number(i);
    }
    
    return total;
}
 
/* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */
res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } );
db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1});
db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);

The equivalent in MySQL is :

drop table if exists NonFTP_Access_log_domain_visits;
create table NonFTP_Access_log_domain_visits (
    `domain` varchar(255) NOT NULL,
    `value` int unsigned not null,
    PRIMARY KEY  (`domain`),
    KEY `value_index` (`value`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8
    select domain, count(*) as value from NonFTP_Access_log group by domain;
select * from NonFTP_Access_log_domain_visits order by value desc limit 10;

Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyze we will have to use Hadoop as MongoDB is really slow.

What am I doing wrong? Are this results normal? Should I use Hadoop?



 Comments   
Comment by Ben Becker [ 27/Aug/12 ]

Hi Jose,

The aggregate command does (roughly) the equivalent of what MySQL does when running a query like this. MapReduce requires processing every document through Javascript.

At this time the default javascript engine (SpiderMonkey) is single-threaded, however we're planning to replace the default engine a faster, multithreaded one (v8) in version 2.3/2.4. You can already build v2.0/v2.2 with v8 by passing the --usev8 flag if you would like to test.

It sounds like the server was processing roughly 833 documents/second, which is a bit slow, but not entirely unexpected. I'm going to go ahead and close this issue out, but please feel free to reopen if you have further questions or issues. Also, the mongodb-user group and IRC are good places to get quick feedback if you run into questions/issues like this.

Best Regards,
Ben

Comment by José Manuel Ciges Regueiro [ 27/Aug/12 ]

Apparently using the group function on Aggregation Framework works well!

The following Javascript code gets the 10 most visited domains with their visits in 17m17s!

 
    db.NonFTP_Access_log.aggregate(
        { $group: {
            _id: "$domain",
            visits: { $sum: 1 }
            }},
        { $sort: { visits: -1 } },
        { $limit: 10 }
        ).result.forEach(printjson);

Anyway I still don't understand why the MapReduce alternative is so slow. Is this a bug or is the way it works by now?

Generated at Thu Feb 08 03:12:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.