Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-6869

MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database)

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.0-rc0
    • Component/s: MapReduce
    • Environment:
      Operating System: Suse Linux Enterprise Server 10 (Virtual Server on Xen)
      RAM: 10 Gb
      Cores: 32 (AMD)

      I am doing now some data analyze tests and in the first, really simple I have got very strange results.

      The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains

      The script I have made in JavaScript is really simple :

      /* Counts each domain url */
      m = function () {
          emit(this.domain, 1 );
      }
      
      r = function (key, values)    {
          total = 0;
          for (var i in values)    {
              total += Number(i);
          }
          
          return total;
      }
      
      /* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */
      res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } );
      db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1});
      db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);
      

      The equivalent in MySQL is :

      drop table if exists NonFTP_Access_log_domain_visits;
      create table NonFTP_Access_log_domain_visits (
          `domain` varchar(255) NOT NULL,
          `value` int unsigned not null,
          PRIMARY KEY  (`domain`),
          KEY `value_index` (`value`)
          ) ENGINE=MyISAM DEFAULT CHARSET=utf8
          select domain, count(*) as value from NonFTP_Access_log group by domain;
      select * from NonFTP_Access_log_domain_visits order by value desc limit 10;
      

      Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyze we will have to use Hadoop as MongoDB is really slow.

      What am I doing wrong? Are this results normal? Should I use Hadoop?

            Assignee:
            benjamin.becker Ben Becker
            Reporter:
            ciges José Manuel Ciges Regueiro
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: