Core Server
  1. Core Server
  2. SERVER-1197

Performance question regarding map/reduce: map reduce mongo fonction slower than naive (python) counting

    Details

    • Type: Question Question
    • Status: Closed Closed
    • Priority: Major - P3 Major - P3
    • Resolution: Works as Designed
    • Affects Version/s: 1.5.2
    • Fix Version/s: None
    • Component/s: Performance
    • Labels:
      None
    • Environment:
      GNU/Linux Ubuntu (Lucid) mongodb-unstable package (version 20100604)
    • Backport:
      No
    • # Replies:
      6
    • Last comment by Customer:
      false

      Description

      We coded the given map/reduce example (http://api.mongodb.org/python/current/examples/map_reduce.html) directly in python and got much better performance (see attached script) ... did we get anything wrong?

      Usage of the script (will create sample data and time the two methods):

      python mongo_map_reduce_counter.py test_db_name

      Example output (with nb_objects=5000, nb_tags=200, nb_bins=3):
      $>python mongo_map_reduce_counter.py test_sdsd
      calc naive time 0.317932844162
      calc map_reduce time 110.605533838
      Same results? True

        Activity

        Hide
        Eliot Horowitz
        added a comment -

        python may be faster than map/reduce for some cases.
        we are going to be working on m/r performance later this year

        Show
        Eliot Horowitz
        added a comment - python may be faster than map/reduce for some cases. we are going to be working on m/r performance later this year
        Hide
        Alan
        added a comment -

        Thanks for your answer. In the particular (and very simple cf. the source code) example the speed difference is really great! I guess we will stick to the "naive" python implementation for now

        Show
        Alan
        added a comment - Thanks for your answer. In the particular (and very simple cf. the source code) example the speed difference is really great! I guess we will stick to the "naive" python implementation for now
        Hide
        Stephen Nelson
        added a comment -

        Why was this issue closed without the problem being corrected? I'm using mongo (version 2.0) for my dissertation research. A mongodb map/reduce function runs between 10 and 100 times slower than a Java implementation which does the same thing.

        To test this I hand-coded a naive java implementation of map/reduce which maps by iterating over a collection, performing the map operation, then storing any emits in a temporary collection. I create an index for the temporary collection, then call reduce which iterates over the temporary collection finding keys, retrieves all entries for that key in batches, stores the result in a new table, and deletes all entries for that key before moving on. When the temporary collection is empty I'm done.

        This naive approach took my map/reduce function operating on several hundred million documents from many days down to hours. I've since written an implementation which uses an cache in java and sequential traversal of the temporary collection without deletes which takes it down by another factor of 10.

        Why is mongo's implementation so freaken slow? Are you loading an entirely new javascript VM for every application of map? Map/reduce's performance is completely at odds with the excellent performance of everything else.

        Show
        Stephen Nelson
        added a comment - Why was this issue closed without the problem being corrected? I'm using mongo (version 2.0) for my dissertation research. A mongodb map/reduce function runs between 10 and 100 times slower than a Java implementation which does the same thing. To test this I hand-coded a naive java implementation of map/reduce which maps by iterating over a collection, performing the map operation, then storing any emits in a temporary collection. I create an index for the temporary collection, then call reduce which iterates over the temporary collection finding keys, retrieves all entries for that key in batches, stores the result in a new table, and deletes all entries for that key before moving on. When the temporary collection is empty I'm done. This naive approach took my map/reduce function operating on several hundred million documents from many days down to hours. I've since written an implementation which uses an cache in java and sequential traversal of the temporary collection without deletes which takes it down by another factor of 10. Why is mongo's implementation so freaken slow? Are you loading an entirely new javascript VM for every application of map? Map/reduce's performance is completely at odds with the excellent performance of everything else.
        Hide
        Eliot Horowitz
        added a comment -

        Javascript is much slower than java.
        If it comes down to that - java will always win.
        The new aggregation framework is the long term solution. SERVER-447

        Show
        Eliot Horowitz
        added a comment - Javascript is much slower than java. If it comes down to that - java will always win. The new aggregation framework is the long term solution. SERVER-447
        Hide
        Stephen Nelson
        added a comment -

        http://shootout.alioth.debian.org/u32/javascript.php

        Nice try, but javascript (v8) is only 3x slower for the type of operations I'm performing. Mongo specifically is adding truly massive overhead to map/reduce operations.

        The new aggregation framework is not adequate for the things I'm doing (and I'm sure many other users) - my map functions need to make decisions about documents based on non-trivial dependent properties. The aggregation framework will not be able to replace m/r; it's not sufficient, as you seem to be aware from your comments on the linked issue.

        Has anyone profiled mongo's map/reduce implementation to determine where the overhead is coming from?

        Show
        Stephen Nelson
        added a comment - http://shootout.alioth.debian.org/u32/javascript.php Nice try, but javascript (v8) is only 3x slower for the type of operations I'm performing. Mongo specifically is adding truly massive overhead to map/reduce operations. The new aggregation framework is not adequate for the things I'm doing (and I'm sure many other users) - my map functions need to make decisions about documents based on non-trivial dependent properties. The aggregation framework will not be able to replace m/r; it's not sufficient, as you seem to be aware from your comments on the linked issue. Has anyone profiled mongo's map/reduce implementation to determine where the overhead is coming from?
        Hide
        Eliot Horowitz
        added a comment -

        If you want to diagnose why your case is slow - can you open a new ticket with the map/reducde code and sample data.

        Show
        Eliot Horowitz
        added a comment - If you want to diagnose why your case is slow - can you open a new ticket with the map/reducde code and sample data.

          People

          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              2 years, 19 weeks, 1 day ago
              Date of 1st Reply: