Loading...

XML

Word

Printable

JSON

Type: Question
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.2.0-rc0
Component/s: MapReduce
Labels:
- mapreduce
Environment:
Operating System: Suse Linux Enterprise Server 10 (Virtual Server on Xen)
RAM: 10 Gb
Cores: 32 (AMD)

Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I am doing now some data analyze tests and in the first, really simple I have got very strange results.

The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains

The script I have made in JavaScript is really simple :

/* Counts each domain url */
m = function () {
    emit(this.domain, 1 );
}

r = function (key, values)    {
    total = 0;
    for (var i in values)    {
        total += Number(i);
    }
    
    return total;
}

/* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */
res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } );
db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1});
db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);

The equivalent in MySQL is :

drop table if exists NonFTP_Access_log_domain_visits;
create table NonFTP_Access_log_domain_visits (
    `domain` varchar(255) NOT NULL,
    `value` int unsigned not null,
    PRIMARY KEY  (`domain`),
    KEY `value_index` (`value`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8
    select domain, count(*) as value from NonFTP_Access_log group by domain;
select * from NonFTP_Access_log_domain_visits order by value desc limit 10;

Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyze we will have to use Hadoop as MongoDB is really slow.

What am I doing wrong? Are this results normal? Should I use Hadoop?

Assignee:: Ben Becker (Inactive)
Reporter:: José Manuel Ciges Regueiro
Participants:: Ben Becker, José Manuel Ciges Regueiro
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Aug 27 2012 09:00:36 AM UTC
Updated:: Feb 15 2013 03:05:45 PM UTC
Resolved:: Aug 27 2012 06:37:43 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates