[SERVER-2976] add a pure JS mode to map reduce to get improved performance for light jobs Created: 21/Apr/11  Updated: 12/Jul/16  Resolved: 27/Jun/11

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: None
Fix Version/s: 1.9.1

Type: Improvement Priority: Major - P3
Reporter: Antoine Girbal Assignee: Antoine Girbal
Resolution: Done Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

mongod will pick mode based on:

  • by default use pure JS
  • if emit key is object, switch to C++ mode right away
  • if hashmap gets more than 1m keys, switch to C++ mode

On Thu, Apr 21, 2011 at 2:41 PM, Antoine Girbal <antoine@10gen.com> wrote:
> > That seems reasonable.
> > But:
> > 1) it's kind of pain to switch in the middle of process from full js to
> > C++.
> > 2) there is still the limitation of key being a primitive value
> >
> > For #1:
> > We would have to decide when to break off from JS, dump the map into C
> > ++, then use regular emits and process. This is kind of a pain code
> > wise. We could try to base it off the # of input keys as unreliable
> > ballpark to choose process at beginning, but that may not be always
> > known in case there is filter.
> >
> > For #2:
> > I think it's not a big limitation, since in most cases the _id of output
> > is a string or number. Even if someone wants to use an object as id,
> > they can make up a unique string from object and carry over any field
> > they need in the value (tried it, works fine and still fast). If we want
> > to support full objects as key in JS, we would need to add a hashmap
> > class and a object "equals" method. An alternative would be to serialize
> > object to JSON, use that as key, then deserialize later on in C++ (but
> > some field types cannot be serialized).
> >
> > I think it would be good at first to let user choose between modes
> > depending on use case, since it would solve #1 and #2.
> > Then once we fully understand the pros/cons of both methods we can add a
> > heuristic in code.
> > In any case it's good to have a flag to give a hint on mode to use in
> > case the heuristic is wrong.
> >
> > On Thu, 2011-04-21 at 14:20 -0400, Eliot Horowitz wrote:
> >> So maybe if keys are less than 1M we do full js.
> >>
> >> On Thu, Apr 21, 2011 at 1:20 PM, Antoine Girbal <antoine@10gen.com> wrote:
> >> > right in this case the # unique keys is 364434
> >> >
> >> > On Thu, 2011-04-21 at 12:15 -0400, Eliot Horowitz wrote:
> >> >> I meant # of keys.
> >> >> i.e. number of documents in output collection.
> >> >>
> >> >> the total # of docs doesn't matter.
> >> >>
> >> >> On Thu, Apr 21, 2011 at 12:13 PM, Antoine Girbal <antoine@10gen.com> wrote:
> >> >> > for roundup.js on my tiny laptop, it was still much faster (over 5x) for
> >> >> > 1 million docs, and memory usage did not shoot up.
> >> >> > Where it breaks down maybe 10m or more, but it may also depend on some
> >> >> > other factors.
> >> >> > Also that's where we may see some big diff between v8 and SM since the
> >> >> > object implementation matters a lot.
> >> >> > To save memory we can easily make code reduce each key when they get too
> >> >> > big, but we would not dump to temp collection.
> >> >> > I think if our goal is to make MR 5x faster or more so that people
> >> >> > consider it a viable option, we need to provide the "pure JS" execution
> >> >> > process.
> >> >> > We could make it an option within the MR options, like "fast" or
> >> >> > "pureJS".
> >> >> > I know we dont want new options but then ppl could choose pureJS for
> >> >> > quick jobs and the standard for real heavy duty stuff.
> >> >> > Also in pureJS mode the emit key would have to be a value not an object.
> >> >> >
> >> >> > going to start committing v8 stuff but without that mode the improvement
> >> >> > will not be significant (about 20% improvement).
> >> >> > AG
> >> >> >
> >> >> > On Wed, 2011-04-20 at 19:33 -0400, Eliot Horowitz wrote:
> >> >> >> Yes.
> >> >> >> For small things it worked well.
> >> >> >> The problem was with large amounts of keys you can't fit it all in js,
> >> >> >> which then makes it a lot more expensive.
> >> >> >> If we could determine how man keys there are...
> >> >> >>
> >> >> >> So we could start all in JS, and once you hit 1000 unique keys, you
> >> >> >> fall back to current method...
> >> >> >> Maybe its really 100k, not sure
> >> >> >>
> >> >> >> On Wed, Apr 20, 2011 at 7:26 PM, Antoine Girbal <antoine@10gen.com> wrote:
> >> >> >> > so 1 way to get massive improvement with map/reduce, about 5x, is to
> >> >> >> > keep the data in js during the whole MR.
> >> >> >> > Right now we do:
> >> >> >> > - map converts all objects from bson to JS
> >> >> >> > - emit is called, converts objects from JS to bson, inserted in C++ map.
> >> >> >> > - reduce is called, objects converted from bson to JS
> >> >> >> > - then result of reduce converted from JS to bson and output
> >> >> >> > Basically there is a lot of back and forth JS/BSON, and many objects are
> >> >> >> > created / translated.
> >> >> >> >
> >> >> >> > Did experiment with following:
> >> >> >> > - map converts all objects from bson to JS
> >> >> >> > - emit just stores object in JS map
> >> >> >> > - reduce is called, objects get reduced in JS
> >> >> >> > - then result of reduce converted from JS to bson and output
> >> >> >> >
> >> >> >> > for mr1.js this brings execution from 1sec to 200ms.
> >> >> >> > for roundup.js it brings emit time from 40s to 15s.
> >> >> >> > If modify roundup to emit a string instead of object for key, gets it
> >> >> >> > down to 20s instead of 60s.
> >> >> >> >
> >> >> >> > There are problems though:
> >> >> >> > - works if keys are string/numbers, but doesnt work if key is object.
> >> >> >> > This would require a real js hashmap and object comparison (not native
> >> >> >> > to js).
> >> >> >> > - potentially more memory consumption at 1 point in time (but much less
> >> >> >> > churning of objects overall).
> >> >> >> >
> >> >> >> > did you guys look at this solution?
> >> >> >> > AG



 Comments   
Comment by Eliot Horowitz (Inactive) [ 10/Jan/12 ]

Please open a new ticket if you are having trouble sharded.

Comment by Juhi Bhatia [ 10/Jan/12 ]

I get the same error for sharded enviornment, the stack trace being:

com.mongodb.CommandResult$CommandFailure: command failed [command failed [mapreduce]

{ "serverUsed" : "localhost:27017" , "ok" : 0.0 , "errmsg" : "unknown m/r field for sharding: jsMode"}

at com.mongodb.CommandResult.getException(CommandResult.java:75)
at com.mongodb.CommandResult.throwOnError(CommandResult.java:121)
at com.mongodb.DBCollection.mapReduce(DBCollection.java:1055)

Is jsMode : true not supported in sharded enviornment?

Comment by Nathan Ehresman [ 18/Oct/11 ]

Is there a reason that this isn't supported in a sharded environment? When I attempt it I get: "unknown m/r field for sharding: jsMode".

Comment by Antoine Girbal [ 27/Jun/11 ]

this mode is fully working but not sure it's a good idea to make it default for now.
We may want to wait until we switch to v8 which has higher default heap limit.
This mode can be used today with the flag "jsMode": true

Comment by auto [ 13/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: added fallback to mixed mode in case an emit key is an object
Branch: master
https://github.com/mongodb/mongo/commit/51011c187b5c5b5fe50d7b6f09b7c12dae2c3de7

Comment by auto [ 13/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: added fallback from js to mixed mode, added reduce steps within js mode
Branch: master
https://github.com/mongodb/mongo/commit/0f391def9476c65bf6e901e5e6838ce7cf39e3b1

Comment by auto [ 12/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: fix segv in m/r if ns doesnt exist
Branch: master
https://github.com/mongodb/mongo/commit/b7509f5433974f36837a07bc8f8f8b3e5893071b

Comment by auto [ 12/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: jsMode now works with inline output
Branch: master
https://github.com/mongodb/mongo/commit/fddb4b44db84495c839bb5c0358191f705926308

Comment by auto [ 12/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: cleaner JS
Branch: master
https://github.com/mongodb/mongo/commit/dad6b228f76b620ccffbb6994226348e263a5281

Comment by auto [ 12/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: slightly better js function. Cleanup of js objects.
Branch: master
https://github.com/mongodb/mongo/commit/860369c353452e0b135d7944a84674862136d59a

Comment by auto [ 12/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: fixed the emit count in js mode
Branch: master
https://github.com/mongodb/mongo/commit/acb028bdcd34d2908d7f354dd98932213abb6ad1

Comment by Antoine Girbal [ 11/May/11 ]

some result from current code as of d47de50498be988f3a8b139214a63e80d4d9fac3

For the M/R of roundup.js (mongo downloads) we get about 2.5x
> db.runCommand({mapReduce: "downloads", map: function()

{ emit( getMonth(this) + "_" + this.ip , 1 ); }

, reduce: function(k,values)

{ return Array.sum( values); }

, out: "myoutnew", verbose: true, jsMode: true })

SM in mixed mode:
{
"result" : "myoutnew",
"timeMillis" : 72510,
"timing" :

{ "mapTime" : NumberLong(50614), "emitLoop" : 65134, "total" : 72510 }

,
"counts" :

{ "input" : 1036354, "emit" : 1036354, "output" : 364434 }

,
"ok" : 1
}

v8 in mixed mode:
{
"result" : "myoutnew",
"timeMillis" : 53050,
"timing" :

{ "mapTime" : NumberLong(28193), "emitLoop" : 42201, "total" : 53050 }

,
"counts" :

{ "input" : 1036354, "emit" : 1036354, "output" : 364434 }

,
"ok" : 1
}

v8 in JS mode:
{
"result" : "myoutnew",
"timeMillis" : 30985,
"timing" :

{ "mapTime" : NumberLong(20077), "emitLoop" : 21112, "total" : 30985 }

,
"counts" :

{ "input" : 1036354, "emit" : 0, "output" : 364434 }

,
"ok" : 1
}

Test running jstests/mr1.js..
v8 in mixed mode:
"timeMillis" : 1573,
"timeMillis" : 1073,

v8 in pure JS mode:
"timeMillis" : 470,
"timeMillis" : 447,

Comment by auto [ 10/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2579: added readonly object for lazy v8 objects, which makes them much faster to access
SERVER-2976: further implementation of jsMode for M/R. Can be turned on using temp flag jsMode:true
Branch: master
https://github.com/mongodb/mongo/commit/13e71279c964fc3f84e682901f37152935557f70

Comment by auto [ 10/May/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-2976: added M/R pure jsMode with collection output
Branch: master
https://github.com/mongodb/mongo/commit/8d13203c40150d93b5747adf4430a8b6edd21ac9

Generated at Thu Feb 08 03:01:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.