Details
Description
I am doing now some data analyze tests and in the first, really simple I have got very strange results.
The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains
The script I have made in JavaScript is really simple :
/* Counts each domain url */
|
m = function () {
|
emit(this.domain, 1 );
|
}
|
|
|
r = function (key, values) {
|
total = 0;
|
for (var i in values) {
|
total += Number(i);
|
}
|
|
return total;
|
}
|
|
|
/* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */
|
res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } );
|
db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1});
|
db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);
|
The equivalent in MySQL is :
drop table if exists NonFTP_Access_log_domain_visits;
|
create table NonFTP_Access_log_domain_visits (
|
`domain` varchar(255) NOT NULL,
|
`value` int unsigned not null,
|
PRIMARY KEY (`domain`),
|
KEY `value_index` (`value`)
|
) ENGINE=MyISAM DEFAULT CHARSET=utf8
|
select domain, count(*) as value from NonFTP_Access_log group by domain;
|
select * from NonFTP_Access_log_domain_visits order by value desc limit 10;
|
Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyze we will have to use Hadoop as MongoDB is really slow.
What am I doing wrong? Are this results normal? Should I use Hadoop?