mongod will pick mode based on:
- by default use pure JS
- if emit key is object, switch to C++ mode right away
- if hashmap gets more than 1m keys, switch to C++ mode
On Thu, Apr 21, 2011 at 2:41 PM, Antoine Girbal <antoine@10gen.com> wrote:
> > That seems reasonable.
> > But:
> > 1) it's kind of pain to switch in the middle of process from full js to
> > C++.
> > 2) there is still the limitation of key being a primitive value
> >
> > For #1:
> > We would have to decide when to break off from JS, dump the map into C
> > ++, then use regular emits and process. This is kind of a pain code
> > wise. We could try to base it off the # of input keys as unreliable
> > ballpark to choose process at beginning, but that may not be always
> > known in case there is filter.
> >
> > For #2:
> > I think it's not a big limitation, since in most cases the _id of output
> > is a string or number. Even if someone wants to use an object as id,
> > they can make up a unique string from object and carry over any field
> > they need in the value (tried it, works fine and still fast). If we want
> > to support full objects as key in JS, we would need to add a hashmap
> > class and a object "equals" method. An alternative would be to serialize
> > object to JSON, use that as key, then deserialize later on in C++ (but
> > some field types cannot be serialized).
> >
> > I think it would be good at first to let user choose between modes
> > depending on use case, since it would solve #1 and #2.
> > Then once we fully understand the pros/cons of both methods we can add a
> > heuristic in code.
> > In any case it's good to have a flag to give a hint on mode to use in
> > case the heuristic is wrong.
> >
> > On Thu, 2011-04-21 at 14:20 -0400, Eliot Horowitz wrote:
> >> So maybe if keys are less than 1M we do full js.
> >>
> >> On Thu, Apr 21, 2011 at 1:20 PM, Antoine Girbal <antoine@10gen.com> wrote:
> >> > right in this case the # unique keys is 364434
> >> >
> >> > On Thu, 2011-04-21 at 12:15 -0400, Eliot Horowitz wrote:
> >> >> I meant # of keys.
> >> >> i.e. number of documents in output collection.
> >> >>
> >> >> the total # of docs doesn't matter.
> >> >>
> >> >> On Thu, Apr 21, 2011 at 12:13 PM, Antoine Girbal <antoine@10gen.com> wrote:
> >> >> > for roundup.js on my tiny laptop, it was still much faster (over 5x) for
> >> >> > 1 million docs, and memory usage did not shoot up.
> >> >> > Where it breaks down maybe 10m or more, but it may also depend on some
> >> >> > other factors.
> >> >> > Also that's where we may see some big diff between v8 and SM since the
> >> >> > object implementation matters a lot.
> >> >> > To save memory we can easily make code reduce each key when they get too
> >> >> > big, but we would not dump to temp collection.
> >> >> > I think if our goal is to make MR 5x faster or more so that people
> >> >> > consider it a viable option, we need to provide the "pure JS" execution
> >> >> > process.
> >> >> > We could make it an option within the MR options, like "fast" or
> >> >> > "pureJS".
> >> >> > I know we dont want new options but then ppl could choose pureJS for
> >> >> > quick jobs and the standard for real heavy duty stuff.
> >> >> > Also in pureJS mode the emit key would have to be a value not an object.
> >> >> >
> >> >> > going to start committing v8 stuff but without that mode the improvement
> >> >> > will not be significant (about 20% improvement).
> >> >> > AG
> >> >> >
> >> >> > On Wed, 2011-04-20 at 19:33 -0400, Eliot Horowitz wrote:
> >> >> >> Yes.
> >> >> >> For small things it worked well.
> >> >> >> The problem was with large amounts of keys you can't fit it all in js,
> >> >> >> which then makes it a lot more expensive.
> >> >> >> If we could determine how man keys there are...
> >> >> >>
> >> >> >> So we could start all in JS, and once you hit 1000 unique keys, you
> >> >> >> fall back to current method...
> >> >> >> Maybe its really 100k, not sure
> >> >> >>
> >> >> >> On Wed, Apr 20, 2011 at 7:26 PM, Antoine Girbal <antoine@10gen.com> wrote:
> >> >> >> > so 1 way to get massive improvement with map/reduce, about 5x, is to
> >> >> >> > keep the data in js during the whole MR.
> >> >> >> > Right now we do:
> >> >> >> > - map converts all objects from bson to JS
> >> >> >> > - emit is called, converts objects from JS to bson, inserted in C++ map.
> >> >> >> > - reduce is called, objects converted from bson to JS
> >> >> >> > - then result of reduce converted from JS to bson and output
> >> >> >> > Basically there is a lot of back and forth JS/BSON, and many objects are
> >> >> >> > created / translated.
> >> >> >> >
> >> >> >> > Did experiment with following:
> >> >> >> > - map converts all objects from bson to JS
> >> >> >> > - emit just stores object in JS map
> >> >> >> > - reduce is called, objects get reduced in JS
> >> >> >> > - then result of reduce converted from JS to bson and output
> >> >> >> >
> >> >> >> > for mr1.js this brings execution from 1sec to 200ms.
> >> >> >> > for roundup.js it brings emit time from 40s to 15s.
> >> >> >> > If modify roundup to emit a string instead of object for key, gets it
> >> >> >> > down to 20s instead of 60s.
> >> >> >> >
> >> >> >> > There are problems though:
> >> >> >> > - works if keys are string/numbers, but doesnt work if key is object.
> >> >> >> > This would require a real js hashmap and object comparison (not native
> >> >> >> > to js).
> >> >> >> > - potentially more memory consumption at 1 point in time (but much less
> >> >> >> > churning of objects overall).
> >> >> >> >
> >> >> >> > did you guys look at this solution?
> >> >> >> > AG