Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-2976

add a pure JS mode to map reduce to get improved performance for light jobs

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 1.9.1
    • Affects Version/s: None
    • Component/s: MapReduce
    • Labels:
      None

      mongod will pick mode based on:

      • by default use pure JS
      • if emit key is object, switch to C++ mode right away
      • if hashmap gets more than 1m keys, switch to C++ mode

      On Thu, Apr 21, 2011 at 2:41 PM, Antoine Girbal <antoine@10gen.com> wrote:
      > > That seems reasonable.
      > > But:
      > > 1) it's kind of pain to switch in the middle of process from full js to
      > > C++.
      > > 2) there is still the limitation of key being a primitive value
      > >
      > > For #1:
      > > We would have to decide when to break off from JS, dump the map into C
      > > ++, then use regular emits and process. This is kind of a pain code
      > > wise. We could try to base it off the # of input keys as unreliable
      > > ballpark to choose process at beginning, but that may not be always
      > > known in case there is filter.
      > >
      > > For #2:
      > > I think it's not a big limitation, since in most cases the _id of output
      > > is a string or number. Even if someone wants to use an object as id,
      > > they can make up a unique string from object and carry over any field
      > > they need in the value (tried it, works fine and still fast). If we want
      > > to support full objects as key in JS, we would need to add a hashmap
      > > class and a object "equals" method. An alternative would be to serialize
      > > object to JSON, use that as key, then deserialize later on in C++ (but
      > > some field types cannot be serialized).
      > >
      > > I think it would be good at first to let user choose between modes
      > > depending on use case, since it would solve #1 and #2.
      > > Then once we fully understand the pros/cons of both methods we can add a
      > > heuristic in code.
      > > In any case it's good to have a flag to give a hint on mode to use in
      > > case the heuristic is wrong.
      > >
      > > On Thu, 2011-04-21 at 14:20 -0400, Eliot Horowitz wrote:
      > >> So maybe if keys are less than 1M we do full js.
      > >>
      > >> On Thu, Apr 21, 2011 at 1:20 PM, Antoine Girbal <antoine@10gen.com> wrote:
      > >> > right in this case the # unique keys is 364434
      > >> >
      > >> > On Thu, 2011-04-21 at 12:15 -0400, Eliot Horowitz wrote:
      > >> >> I meant # of keys.
      > >> >> i.e. number of documents in output collection.
      > >> >>
      > >> >> the total # of docs doesn't matter.
      > >> >>
      > >> >> On Thu, Apr 21, 2011 at 12:13 PM, Antoine Girbal <antoine@10gen.com> wrote:
      > >> >> > for roundup.js on my tiny laptop, it was still much faster (over 5x) for
      > >> >> > 1 million docs, and memory usage did not shoot up.
      > >> >> > Where it breaks down maybe 10m or more, but it may also depend on some
      > >> >> > other factors.
      > >> >> > Also that's where we may see some big diff between v8 and SM since the
      > >> >> > object implementation matters a lot.
      > >> >> > To save memory we can easily make code reduce each key when they get too
      > >> >> > big, but we would not dump to temp collection.
      > >> >> > I think if our goal is to make MR 5x faster or more so that people
      > >> >> > consider it a viable option, we need to provide the "pure JS" execution
      > >> >> > process.
      > >> >> > We could make it an option within the MR options, like "fast" or
      > >> >> > "pureJS".
      > >> >> > I know we dont want new options but then ppl could choose pureJS for
      > >> >> > quick jobs and the standard for real heavy duty stuff.
      > >> >> > Also in pureJS mode the emit key would have to be a value not an object.
      > >> >> >
      > >> >> > going to start committing v8 stuff but without that mode the improvement
      > >> >> > will not be significant (about 20% improvement).
      > >> >> > AG
      > >> >> >
      > >> >> > On Wed, 2011-04-20 at 19:33 -0400, Eliot Horowitz wrote:
      > >> >> >> Yes.
      > >> >> >> For small things it worked well.
      > >> >> >> The problem was with large amounts of keys you can't fit it all in js,
      > >> >> >> which then makes it a lot more expensive.
      > >> >> >> If we could determine how man keys there are...
      > >> >> >>
      > >> >> >> So we could start all in JS, and once you hit 1000 unique keys, you
      > >> >> >> fall back to current method...
      > >> >> >> Maybe its really 100k, not sure
      > >> >> >>
      > >> >> >> On Wed, Apr 20, 2011 at 7:26 PM, Antoine Girbal <antoine@10gen.com> wrote:
      > >> >> >> > so 1 way to get massive improvement with map/reduce, about 5x, is to
      > >> >> >> > keep the data in js during the whole MR.
      > >> >> >> > Right now we do:
      > >> >> >> > - map converts all objects from bson to JS
      > >> >> >> > - emit is called, converts objects from JS to bson, inserted in C++ map.
      > >> >> >> > - reduce is called, objects converted from bson to JS
      > >> >> >> > - then result of reduce converted from JS to bson and output
      > >> >> >> > Basically there is a lot of back and forth JS/BSON, and many objects are
      > >> >> >> > created / translated.
      > >> >> >> >
      > >> >> >> > Did experiment with following:
      > >> >> >> > - map converts all objects from bson to JS
      > >> >> >> > - emit just stores object in JS map
      > >> >> >> > - reduce is called, objects get reduced in JS
      > >> >> >> > - then result of reduce converted from JS to bson and output
      > >> >> >> >
      > >> >> >> > for mr1.js this brings execution from 1sec to 200ms.
      > >> >> >> > for roundup.js it brings emit time from 40s to 15s.
      > >> >> >> > If modify roundup to emit a string instead of object for key, gets it
      > >> >> >> > down to 20s instead of 60s.
      > >> >> >> >
      > >> >> >> > There are problems though:
      > >> >> >> > - works if keys are string/numbers, but doesnt work if key is object.
      > >> >> >> > This would require a real js hashmap and object comparison (not native
      > >> >> >> > to js).
      > >> >> >> > - potentially more memory consumption at 1 point in time (but much less
      > >> >> >> > churning of objects overall).
      > >> >> >> >
      > >> >> >> > did you guys look at this solution?
      > >> >> >> > AG

            Assignee:
            antoine Antoine Girbal
            Reporter:
            antoine Antoine Girbal
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: