[SERVER-699] Support other scripting languages (eg perl) for map/reduce Created: 04/Mar/10  Updated: 06/Dec/22

Status: Open
Project: Core Server
Component/s: Usability
Affects Version/s: None
Fix Version/s: features we're not sure of

Type: New Feature Priority: Major - P3
Reporter: josh rabinowitz Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 21
Labels: map-reduce
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Participants:

 Description   

It would be advantageous to be able to use other scripting languages in map/reduce tasks (for me, perl, though I could see python being a good fit too).

This would allow developers to write map/reduce tasks more easily, and to allow them to access code and libraries in that language which might be advantageous during the map/reduce tasks.



 Comments   
Comment by josh rabinowitz [ 02/Dec/10 ]

I'm the original poster of this JIRA (not that I was the first to want support for other languages in m/r). It's been interesting to see how the conversation here has evolved.

To add my $0.01: +1 to streaming solution. And BSON in/out sounds just fine.

Comment by Bobby J [ 01/Dec/10 ]

Big priority for us. We chose to use mongodb partly because pymongo integrated so nicely into our python codebase. Now we find ourselves using hadoop for mapreduce jobs just so we can keep our mapper/reducer functionality in python. Thanks for looking into this!

Comment by Paul Harvey [ 16/Oct/10 ]

I can appreciate that this task may be a little open-ended, there are some interesting design decisions to make. Turning mongo into a full-blown distributed HPC platform might be asking too much. But we would really appreciate a streaming solution also - no matter how primitive.

Although we will be storing raw data in mongodb, the system we are building is only able to exploit mongo for metadata (management of the raw data). As things currently stand, we either have to fund someone to re-work a precious few algorithms into mongo+m/r javascript (costly, unsustainable), relying on sharding to have any hope of reasonable CPU utilisation or alternatively we build an in-house API to bridge the raw data from mongob to an entirely separate distributed HPC framework.

We work in bioinformatics - many problems fit embarrassingly well into map/reduce, but we rely heavily on libraries to the bulk of the work (python, perl, ruby - probably in that order - though people use things like R on their workstations)

Comment by Valery Khamenya [ 22/Sep/10 ]

+1 to Mathieu Poumeyrol

Comment by Eliot Horowitz (Inactive) [ 22/Jun/10 ]

@mathieu @cyril we agree. we haven't gotten to it yet - but its definitely one of the things we want to support.
first version will probably require you to manage binaries, and the api will be BSON in and out

Comment by Cyril Mougel [ 22/Jun/10 ]

I totaly agree with Mathieu. The streaming solution is really good solution to use what we want to do for map/reduce. In certain way, it can help us to made a multi-threading map/reduce function because it's our program to be multi-thread, not MongoDB.

Comment by Mathieu Poumeyrol [ 22/Jun/10 ]

I had a conversation at MongoFR with Matthias, and I think it would be a good place to followup.

I think we need to have something similar to hadoop streaming. The principle is simple, each mapper starts an external process with a command specified by the user, push each document on the process STDIN, get each emitted value on its STDOUT. And same for reducer.

That would give support in m/r for any language that can read and write json and/or bson, instead of having to pick one or a few languages that will let most users frustrated and require more and more heavy code maintenance and complex dependencies

The nice point with this approach is also it is very easy to simulate map reduce using unix pipes in a development environment.

Matthias expressed concern about that feature allowing arbitrary code execution on the server, but that is a risk that can be mitigated : we may want to limit it to some directory where the admin put the scripts, or even a more defined list, or have a map/reduce worker running as nobody... but that would seriously make the installation more difficult.

As far as I'm concerned, I'd prefer the server to let me do whatever I want, with the user mongo is actually running. My data, my responsibility.

For your information, hadoop also manages code transport from wherever the job is launched to the various nodes.

The use case I'm investigating is log analysis, I would love to get all my logs into mongo to support real time collection, long term storage, massive analysis and pinpoint debugging. But in massive analysis, streaming is an absolute must.

http://wiki.apache.org/hadoop/HadoopStreaming

Comment by Evan Wies [ 15/Apr/10 ]

Some notes on Lua. Lua is really fast and designed to be embedded.

mstearn said that preserving ordering of keys is important. Lua doesn't do that. There was a patch in January 2010 that does that (http://lua-users.org/lists/lua-l/2010-01/msg00199.html). Since it is a patch rather than a library, it would only be feasible for the server. Clients can't be expected to use patched VM.

Lua has an awesome JIT (http://www.luajit.org). It would make a lot of sense for map/reduce. You'd need to port the ordered patch to it. You'd also want to wait for the Foreign Function Interface (FFI) or raw struct access to be added.

Notes on syntax: Lua table is a little cleaner syntax than pure JSON.
pure JSON:

{ "name" : "mongo", "type" : "db" }

Lua:

{ name = "mongo", type = "db" }

Since the native (albeit configurable) data type in Lua is a double, there are issues storing 64-bit integers. There are various solutions for this which are web-searchable.

Comment by Ben Poweski [ 05/Apr/10 ]

For us perl with all of its complexity, and well Perlisms, would a less desirable language. I think Lua would be a natural fit.

Comment by Eliot Horowitz (Inactive) [ 04/Mar/10 ]

agree it would be nice - but non-trivial
we would either have to embed each one, or provide a general process io version of map/reduce
unclear how well that would work, but perhaps

Generated at Thu Feb 08 02:54:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.